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Abstract 
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This  paper  addresses  4be  nonlinear  least-squares  problem^minrejt-  )|/( x  where  /(/)  is 
a  vector  in  9?”'  whose  components  are  smooth  nonlinear  functions.  The  problem ‘‘arises  most 
often  in  data  fitting  applications.  Much  research  has  focused  on  the  development  of  specialized 

algorithms  that  attempt  to  exploit  the  structure  of  the  nonlinear  least-squares  objective.  We 

F 

survey  methods  developed  for  problems  in  which  sparsity  in  the  derivatives  of  jJ1  is  not  taken  into 
account  in  formulating  algorithms.  :  ■, .  —  \  '  ->  ->„/.<  -  /v  '>• 
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1.  Introduction 


This  paper  addresses  the  problem  of  minimizing  the  I 2  norm  of  a  multivariate  function  : 

mm  1 1|/{7 )|^, 

where  f(i)  is  a  vector  in  5fm  whose  components  are  real-valued  nonlinear  functions  with  con¬ 
tinuous  second  partial  derivatives.  We  shall  refer  to  the  function  5||/(*')|l2  as  the  nonlinear 
/ea.«f-sqtiare.«  objective  function.  An  alternative  formulation  of  the  problem  is  that  of  minimiz¬ 
ing  a  sum  of  squares  : 

5  £«')’• 

1  =  1 

where  each  d>i  is  a  real-valued  function  having  continuous  second  partial  derivatives. 

There  is  considerable  interest  in  the  nonlinear  least-squares  problem,  because  it  arises  in 
virtually  all  areas  of  quantitative  research  in  data-fitting  applications.  A  typical  instance  is  the 
choice  of  parameters  0  within  a  nonlinear  model  so  that  the  model  agrees  with  measured 
quantities  rf,  as  closely  as  possible: 

t=sl 

where  r,  are  prescibed  values.  Much  research  has  focused  on  the  development  of  specialized 
algorithms  that  attempt  to  exploit  the  structure  of  the  nonlinear  least-squares  objective.  Despite 
these  efTorls,  methods  do  not  perform  equally  well  on  all  problems,  and  it  is  generally  not  possible 
to  characterize  those  problems  on  which  a  particular  method  will  work  well. 

In  this  paper,  we  survey  existing  numerical  methods  for  dense  nonlinear  least-squares  prob¬ 
lems.  For  a  study  of  the  performance  of  widely-distributed  software  for  nonlinear  least-squares, 
see  Fraley  [1987a,  1988],  We  assume  a  knowledge  of  numerical  methods  for  linear  least-squares 
problems  (e.  g  ,  Lawson  and  Hanson  [1974],  and  Golub  and  Van  Loan  [1983]).  We  also  assume 
familiarity  with  Newton-based  linesearch  and  trust-region  methods  for  unconstrained  minimiza¬ 
tion  (e.  g  ,  Fletcher  [1980],  Gill,  Murray,  and  Wright  [1981],  Dennis  and  Schnabel  [1983],  and 
More  and  Sorensen  [1984]),  If  T  is  the  function  to  be  minimized,  recall  that  both  linesearch  and 
trust-region  methods  involve  iterative  minimization  of  a  quadratic  local  model 

Q(p)  =  v;T(x*)tp+^t//*p 

for  T(ik  +  p)  -  T{ti<),  the  change  in  T  at  the  current  iterate  r*.  In  linesearch  methods,  the 
vector  pj'-  defined  by 

vV  =  mi"  re*"C(p) 


is  used  as  a  search  direct  ion.  A  positive  step  is  taken  from  along  ;>J'  to  the  next  iterate, 


that  is. 


•r*  +  l  =  Tk  +  'UP*' 


where  the  steplength  n*  >  0  is  computed  by  approximate  minimization  of  the  function  $*(a)  = 
■7-(x*  +  n  />£' ) .  The  vector  ;r*e  must  be  a  descent  direction  for  T  at  x*  —  in  other  words, 
)Tp'kf  <0  —  so  that  T  initially  decreases  along  />£-'  from  x*.  Normally  //*  is  required 
to  be  positive  definite,  which  guarantees  that  the  quadratic  model  has  a  unique  minimum  that 
is  a  descent  direction.  In  trust-region  methods, 


-  _  _  i  _rn 

-  t  +  l  —  T*  +  P*  i 


where 


vV  ~  arg  min  reiT"  0(p)  subject,  to  ||p||  <  6*. 


The  rationale  for  restricting  the  size  of  p  in  the  subproblem  is  that  Q(p)  is  a  good  approximation 
to  T  only  at  points  close  to  x*. 

1.1  Definitions  and  Notation 

We  shall  use  the  following  definitions  and  noiational  conventions: 

•  Generally  subscripts  on  a  function  mean  that  the  function  is  evaluated  at  the  corresponding 
subscripted  variable  (for  example,  /*  =  /(x*)).  An  exception  is  made  for  the  residual 
functions  <t>,,  where  the  subscript  is  the  component  index  for  the  vector  /. 

•  /  -  The  vector  of  nonlinear  functions  whose  1 2  norm  is  to  be  minimized. 

The  nonlinear  least-squares  problem  is 

5/(r)T/(r)’ 

where  the  factor  is  introduced  in  order  to  avoid  a  factor  of  two  in  the  derivatives. 

•  4>i  -  The  »th  residual  function,  also  the  fth  component  of  the  vector  /. 


/Or)  3 
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An  alternative  formulation  of  the  nonlinear  least-squares  problem  is 


1  m 
min  -  Y 
ret?-  2  t—1 
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where  each  is  a  smooth  function  mapping  3i"  to  3?. 


•  J  -  The  ?n  x  v  Jacobian  matrix  of  /. 

>(j)  =  V/(r)  = 

m 

\T7T 

•  ^  -  The  gradient  of  the  nonlinear  least-squares  objective. 

5(7)  ^  V  Q  f{r)Tf(x)j  =  J(z)Tf(i) 

•  B  -  The  part  of  the  Hessian  matrix  of  the  nonlinear  least-squares  objective  that  involves 
second  derivatives  of  the  residual  functions.  We  have 

(j  /(*)T/(*))  =  ./(7)T./(r)  +  B(r), 

where 

m 

B(r)s'£4i(r)V34i(r). 

1=1 

•  7v  (A )- The  range  of  ,4. 

If  .4  is  an  VI  x  n  matrix,  then  V(A)  =  {6  6  |  At  =  6  for  some  7  €  3?"  }  is  a  subspace 

of  3r. 

•  A  (-4)  -  The  null  space  of  A. 

If  A  is  an  m  x  n  matrix,  then  A”(/4)  =  {z  €  5?"  j  A:  =  0}  is  a  subspace  of  5f".  Ar(.4)  is 
the  orthogonal  complement  of  V(AT)  in  3?". 

2.  Gauss-Newton  Methods 

The  classical  approach  to  nonlinear  least  squares,  called  the  Gauss-Newton  method,  is  a 
linesearch  method  in  which  the  search  direction  at  the  current  iterate  minimizes  the  quadratic 
function 

ffkP+^PTJl-hP'  (2-U 

The  function  (C.1 )  is  a  local  approximation  to  5  11/(7*  4  5  ||/(7*  )||2  m  which  each  residual 

component  of  /  is  approximated  by  a  linear  function,  using  the  relationship 

f{rk  +  p)  =  f(rk)  +  J(7*)p  4  0(|!/>||2). 
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As  a  model  for  the  change  in  the  least-squares  objective,  (2.1)  has  the  advantage  that  it  involves 
only  first  derivatives  of  the  residuals,  and  that  .7T.7  is  always  at  least  positive  semi-definite.  If 

7 =  nr8  m*"  r€S".9kTl’+  JpT-7*T^7>. 


then  j>£N  satisfies  the  equations 


JkJkP  =  ~9k. 


and  is  therefore  a  direction  of  descent  for  fj /*  whenever  g\,  £  0  —  as  required  in  a  linesearch 
method.  To  guarantee  convergence  to  a  local  minimum,  the  sequence  of  search  directions  must 
also  be  bounded  away  from  orthogonality  to  the  gradient,  a  condition  that  may  not  be  met  by 
successive  Gauss-Newton  directions  unless  the  eigenvalues  of  JTJ  are  bounded  away  from  zero. 
Powell  [1970]  gives  an  example  of  convergence  of  a  Gauss-Newton  method  with  exact  linesearch 
to  a  non-stationary  point. 

The  Gauss-Newton  method  can  be  viewed  as  a  modification  of  Newton's  method  in  which 
JrJ  is  used  to  approximate  the  Hessian  matrix 

rn 

JTJ  +  J>V2*,- =  JTJ  +  B 

i=i 

of  the  nor. linear  least-squares  objective  function.  The  assumption  is  that  the  matrix  JTJ  should 
be  a  good  approximation  to  the  full  Hessian  when  the  residuals  are  small.  In  fact,  if  /(z*)  =  0 
and  J(i')  is  positive  definite,  then  the  sequence  { t k  -+  P*N}  is  locally  quadratically 

convergent  to  7*.  because 

,7(7l)TJ(7t)  =  I  V2||/(7*)H2  4-  C>(||7*  -  7*|j). 

For  more  convergence  results  and  detailed  convergence  analysis  for  the  Gauss-Newton  method, 
see,  e.  g..  Chapter  10  of  Dennis  and  Schnabel  [1983],  Schaback  [1985],  and  Haussler  [1986],  as 
well  as  some  of  the  references  cited  below. 

McKeown  [1975a,  1975b]  studies  test  problems  of  the  form, 


/(x)  -  /o  +  f’o7  +  j 


7T  H\Z 


r  THmr 


chosen  so  that  factors  affecting  the  rate  of  convergence  could  be  controlled.  He  uses  three 
such  problems,  with  seven  different  values  of  a  parameter  that  varies  an  asymptotic  linear  con¬ 
vergence  factor.  The  algorithms  tested  include  some  quasi-Newton  methods  for  unconstrained 
optimization,  as  well  as  some  specialized  methods  for  nonlinear  least  squares  that  have  since 


been  superseded.  He  concludes  that,  when  the  asymptotic  convergent  factor  is  small,  the  Gauss- 
IMewton  method  is  more  efficient  than  the  quasi-Newton  methods,  but  that  the  opposite  is  true 
when  the  asympotic  convergence  factor  is  large.  Fraley  [1987a,  b;  1988]  gives  numerical  re¬ 
sults  for  some  Gauss-Newton  methods  using  these  problems,  and  observes  that  the  Jacobian  is 
well-conditioned  at  every  iteration. 

A  difficulty  with  the  Gauss-Newton  method  arises  when  JTJ  is  singular,  or,  equivalently, 
when  J  has  linearly  dependent  columns,  because  then  (2.1)  does  not  have  a  unique  minimizer. 
For  this  reason  the  Gauss-Newton  method  should  more  accurately  be  viewed  as  a  class  of  methods, 
each  member  being  distinguished  by  a  different  choice  of  ;>  when  JrJ  is  singular.  The  set  of 
vectors  that  minimize  (2.1)  is  the  same  as  the  set  of  solutions  to  the  linear  least-squares  problem 

min  ||J*p+M|2.  (2.3) 

p€»" 

One  (theoretically)  well-defined  alternative  that  is  often  approximated  computationally  is  to  re¬ 
quire  the  unique  solution  of  minimum  /2  norm  : 

nun||p|i2,  (2.4) 

where  5  is  the  set  of  solutions  to  (2.3).  Another  alternative  is  to  replace  J  in  (2.3)  by  a  maximal 
linearly  independent  subset  of  its  columns.  In  finite-precision  arithmetic,  there  is  often  some 
ambiguity  about  how  to  formulate  and  solve  an  alternative  to  (2.3)  when  the  columns  of  J  are 
"nearly"  linearly  dependent,  so  that,  from  a  computational  standpoint,  any  particular  Gauss- 
Newton  method  must  be  still  viewed  as  a  class  of  methods.  The  references  cited  above  for  linear 
least  squares  discuss  at  length  the  difficulties  inherent  in  computing  solutions  to  (2.3)  when  J 
is  ill-conditioned,  and  show  that  the  numerical  solution  of  these  problems  is  dependent  on  the 
criteria  used  to  estimate  the  rank  of  J.  From  now  on,  the  term  "Gauss-Newton  method”  will 
refer  to  any  linesearch  method  that  has  the  following  two  properties.  First,  the  nonlinear  least- 
squares  objective  is  used  as  a  merit  function  for  the  linesearch.  Second,  the  search  direction  is 
the  result  of  some  well-defined  computational  procedure  for  solving  (2.3). 

For  a  survey  of  some  of  the  early  research  on  numerical  Gauss-Newton  methods,  see  Dennis 
[1977]  More  recently,  Deuflhard  and  Apostolescu  [1980]  suggest  selecting  a  steplength  for  the 
Gauss  Newton  direction  based  on  decreasing  the  merit  function  ra,^er  than  ll/(?')ll2' 

for  a  class  of  nonlinear  least-squares  problems  that  includes  zero-residual  problems.  The  function 
Jl  is  the  pseudo- in  verse  of  .7*  (see,  e.  g.  Chapter  6  of  Golub  and  Van  Loan  [1983]);  j\fk  is 
another  way  of  representing  the  minimum  /2-norm  solution  to  \\-hp A||2  They  reason  that 
the  Gauss-Newton  direction  is  the  steepest-descent  direction  for  the  function  ||j//(x)||2,  so  that 
the  geometry  of  the  level  surfaces  defined  by  ||./jJ/(j,)||2  is  more  favorable  to  avoiding  small  steps 


in  the  linesearch.  A  shortcoming  of  this  approach  (pointed  out  by  the  authors)  is  that  there  are 
no  global  convergence  results.  The  merit  function  depends  on  /*,  so  that  a  different  function  is 
being  reduced  at  each  step  Another  difficulty  is  that,  although  the  authors  state  that  numerical 
experience  supports  selection  of  a  steplength  based  on  JjJ^  f(x  )||2  for  ill-conditioned  problems,  the 
transformation  .7^  is  not  numerically  well-defined  under  these  circumstances.  Therefore  neither 
the  Gauss-IMewton  search  direction,  nor  the  merit  function,  is  numerically  well-defined  when  the 
columns  of  .ft  are  nearly  linearly  dependent. 

There  is  another  reason  why  it  is  difficult  to  say  precisely  what  is  meant  by  a  "Gauss-Newton 
method"  for  a  particular  nonlinear  least-squares  problem.  To  see  this,  let  Q(x)  be  an  f  x  m 
orthogonal  matrix  function  on  9f",  that  is,  Q(x)r  Q(x)  =  1  for  all  x.  Then  \\Q{x)J(x)\\\  = 
||/(7)||?  for  all  /,  and  consequently  the  function  f  =  Qf  defines  the  same  nonlinear  least- 
squares  problem  as  /.  The  Jacobian  matrix  of  /  is  J  =  QJ  +  (^Q)f,  so  that  a  minimizer  of 
||  jp  +  /||2  will  ordinarily  be  different  from  a  minimizer  of  ||.7;>  +  /||2,  unless  Q(x)  happens  to 
be  a  constant  transformation.  However,  if  both  Q  and  f  have  k  continuous  derivatives,  then 

V'||Q(7)/(/)||^  =  V||/(7)||’  for  i  =  1.2 . k.  Letting  IT  =  (VQ)/,  so  that  J  =  QJ  +  IT, 

we  have 

Pj'  =  J1  J  +  (JtQt1T  -I-  ITtQJ)  +  ITt1V. 

showing  that  the  Gauss-Newton  approximation  JTJ  to  the  full  Hessian  matrix  is  changed  when 
/  is  transformed  by  an  orthogonal  function  that  varies  with  x.  Thus,  with  exact  arithmetic,  there 
are  many  Gauss-Newton  methods  corresponding  to  a  given  vector  function,  although  Newton's 
method  remains  invariant  (see  also  Nocedal  and  Overton  [1985],  p.  826).  In  fact,  each  step 
of  a  Gauss-Newton  method  could  be  defined  by  a  different  transformation  of  /.  Moreover,  the 
conditioning  of  J  may  be  very  different  from  that  of  J,  so  that,  for  example,  the  columns  of 
J  might  be  strongly  independent,  while  J  is  nearly  rank  deficient.  Since  the  number  of  rows 
in  Q  may  be  greater  than  n,  it  is  possible  to  imbed  the  given  nonlinear  least-squares  problem 
in  a  larger  one.  This  suggests  the  (so  far  unexplored)  idea  of  preconditioning  a  Gauss-Newton 
method  at  each  step  with  an  orthogonal  function. 

Although  it  is  known  that  Gauss-Newton  methods  do  not  work  well  under  all  circumstances, 
it  is  not  possible  to  say  anything  more  precise  about  the  method  when  considering  large  and 
varied  sets  of  test  problems  Gauss-Newton  methods  are  of  practical  interest  because  there  are 
many  instances  in  which  they  work  very  well  in  comparison  to  other  methods.  In  fact,  most 
successful  specialized  approaches  to  nonlinear  least-squares  problems  are  based  to  some  extent 
on  Gauss-Newton  methods  and  attempt  to  exploit  this  behavior  whenever  possible.  However,  it 


r. 


is  not  hard  to  find  cases  where  Gauss-Newton  methods  perform  poorly,  so  that  they  cannot  be 
successfully  applied  to  general  nonlinear  least-squares  problems  without  modification. 

Fraley  [1987a,  1988]  gives  numerical  results  for  a  large  set  of  test  problems  using  widely- 
distributed  software  for  unconstrained  optimization  and  nonlinear  least  squares.  She  also  includes 
some  Gauss-Newton  methods  that  use  LSSOL  [Gill  et  al.  (1986a)]  to  solve  the  linear  least-squares 
subproblem  (2.3).  Her  findings  confirm  that  Gauss-Newton  methods  are  often  among  the  best 
available  techniques  for  nonlinear  least  squares  —  especially  for  zero-residual  problems  —  but 
that  there  are  many  cases  in  which  they  fail  or  are  inefficient.  Detailed  examples  are  presented 
that  illustrate  some  of  the  difficulties  involved  in  characterizing  those  problems  on  which  Gauss- 
Newton  methods  will  or  will  not  work  well  (see  also  Fraley  [1987b]). 

Many  attempts  have  been  made  to  define  algorithms  that  depart  from  the  Gauss-Newton 
strategy  only  when  necessary.  Bard  [1970]  compares  some  Gauss-Newton-based  methods  with 
a  Levenberg-Marquardt  method  (Section  3)  and  some  quasi-Newton  methods  for  unconstrained 
optimization  on  a  set  of  ten  test  problems  from  nonlinear  parameter  estimation.  He  uses  the 
eigenvalue  decomposition  of  JT J  to  solve  the  normal  equations  (2.2).  In  order  to  ensure  a 
positive-definite  system,  he  modifies  the  eigenvalues  if  their  magnitude  fat's  below  a  certain 
threshold.  In  addition,  his  implementations  include  bounds  on  the  variables  that  are  enforced  by 
adding  a  penalty  term  to  the  objective  function.  He  finds  that  the  Gauss-Newton-based  methods 
are  more  efficient  in  terms  of  function  and  derivative  evaluations  than  the  quasi-Newton  methods, 
but  that  there  is  no  significant  difference  in  the  relative  performance  of  the  Gauss-Newton-based 
methods  and  the  Levenberg-Marquardt  method. 

Betts  [1976]  proposes  an  algorithm  that  combines  a  Gauss-Newton  method  with  a  method 
in  which  the  Gauss-Newton  approximate  Hessian  JTJ  »  augmented  by  a  quasi-Newton  approx¬ 
imation  to  the  second-order  term  B  =  </>,V2d>;  in  the  nonlinear  least-squares  Hessian  (see 

Section  5).  The  algorithm  starts  with  a  Gauss-Newton  method,  and  then  switches  to  the  aug¬ 
mented  Hessian  when  it  is  believed  that  the  iterates  are  near  the  solution.  The  criterion  for  the 
switch  is 

IIp*I!2  <  f  0  +  ll^llj)-  (2.5) 

for  some  r  <  1.  Results  are  presented  for  the  hybrid  methods,  as  well  as  for  the  underlying 
Gauss-Newton  method  and  special  quasi-Newton  method  (see  Section  5),  on  a  set  of  eleven 
test  problems.  Betts  concludes  that  the  hybrid  method  is  superior,  especially  on  problems  with 
nonzero  residuals,  although  the  results  he  lists  in  his  tables  do  not  all  have  the  same  value  of  f 
in  (2. 8).  Another  issue  that  is  not  clarified  is  the  treatment  of  near-singularity  or  indefiniteness 
in  the  quadratic  model  in  any  of  the  methods  tested.  Also  the  test  (2.-‘>)  may  not  necessarily 
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imply  that  the  Gauss-Newton  iterates  are  in  the  vicinity  of  a  solution,  and  could  instead  indicate 
inefficiency  in  the  Gauss-Newton  method  at  some  arbitrary  point. 


Ramsin  and  Wedin  [1977]  compare  the  performance  of  a  Gauss-Newton-based  method  with 
that  of  a  Levenberg-Marquardt  method  for  nonlinear  least  squares  and  a  quasi-Newton  method  for 
unconstrained  optimization,  both  from  the  Harwell  Library.  The  quasi-Newton  routine  required 
an  initial  estimate  I! o  of  the  Hessian  matrix,  and  the  choice  //o  =  «f(^o)T^(7o)  was  rnade  on 
the  basis  of  preliminary  tests  that  showed  equal  or  better  performance  compared  to  Ho  =  /. 
The  test  problems  were  constructed  so  that  asymptotic  properties  could  be  monitored  and  are 
similar  to  those  of  McKeown  [1975a,  1975b]  mentioned  above.  In  all  cases  considered,  the 
Jacobian  matrix  had  full  column  rank  at  the  solution.  The  algorithm  of  Ramsin  and  Wedin  uses 
the  steepest-descent  direction,  rather  than  the  Gauss-I.ewton  direction,  whenever  the  decrease 
in  the  objective  is  considered  unacceptably  small.  The  experiments  involved  variation  of  a  large 
number  of  parameters.  Ramsin  and  Wedin  conclude  that  their  Gauss-Newton-based  method  and 
the  Levenberg-Marquardt  method  are  identical  when  the  asymptotic  convergence  factor  is  small, 
but  that  neither  method  is  consistently  better  for  large  asymptotic  convergence  factors.  Also, 
they  find  that  in  instances  when  the  asymptotic  convergence  factor  is  large,  the  quasi-Newton 
method  may  be  more  efficient,  although  superlinear  convergence  of  the  quasi-Newton  method 
was  never  observed.  Ramsin  and  Wedin  maintain  that  Gauss-Newton  should  not  be  used  when 
(»)  the  current  iteiate  xk  is  close  to  the  solution  /*,  and  the  relative  decrease  in  the  size  of 
the  gradient  is  small,  (fr)  is  not  near  r" ,  and  the  decrease  in  the  sum  of  squares  relative  to 
the  size  of  the  gradient  is  small,  or  (m)  Ji,  is  nearly  rank-deficient.  Conditions  (f)  and  (>i)  are 
indicators  of  inefficiency  for  any  minimization  algorithm;  in  general  the  problem  of  ascertaining 
the  closeness  of  an  iterate  to  a  minimum  is  as  difficult  as  solving  the  original  problem.  As  for 
condition  (in),  rapidly  convergent  Gauss-Newton  methods  may  exist  even  if  nearly  rank-deficient 
Jacobians  are  encountered,  but  that  it  appears  difficult  to  formulate  a  single  rule  for  estimating 
the  rank  of  the  Jacobian  that  is  satisfactory  for  all  such  problems  (see  Fraley  [1987a,  b]). 

Wedin  and  Lindstrom  [1987]  have  developed  a  hybrid  algorithm  for  nonlinear  least-squares 
that  combines  a  Gauss-Newton  method  with  a  finite-difference  Newton  method.  A  Gauss-Newton 
search  direction  is  computed  at  every  iteration  using  a  QR  factorization.  The  rank  estimation 
criteria  are  complicated,  and  search  directions  for  several  different  estimates  of  the  rank  of  J 
may  be  tried  before  a  step  is  actually  taken.  A  finite  difference  Newton  step  may  be  used  when 
the  steps  along  Gauss-Newton  directions  become  small,  and  the  iterates  are  judged  to  be  close 
to  a  solution  In  the  algorithm,  the  decision  about  whether  the  iterates  are  close  to  the  solution 
is  based  on  the  relation  ||/;.||,  >  for  some  ")  >  1 ,  where  ih  is  the  norm  of  the  projection 
of  ft,  onto  the  range  of  -h  The  ratio  is  used  as  an  estimate  of  an  asymptotic  linear 
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convergence  factor.  They  give  numerical  results  for  a  set  of  thirty  large-residual  test  problems 
constructed  by  Al-Baali  and  Fletcher  [1985],  and  compare  their  results  with  those  given  by  Al- 
Baali  and  Fletcher  for  two  hybrid  Gauss-Newton/BFGS  methods  (Section  9)  and  a  version  of 
NL2S0L  (Section  5).  Wedin  and  Lindstrom  find  that  their  method  gives  better  overall  results 
than  the  other  methods,  although  their  method  does  fail  in  three  cases  due  to  a  finite-difference 
Hessian  that  is  not  positive  definite. 

3.  Levenberg-Marquardt  Methods 

In  Levenberg-Marquardt  methoas,  the  Gauss-Newton  quadratic  model  (2.1)  is  minimized 
subject  to  a  trust-region  constraint.  The  step  p  between  successive  iterates  solves 

min  qTp  -I-  -  pT JTJp  (3.1) 

re*"'  2 

subject  to  ||Dj>||2  <  fi. 

for  some  fi  >  0  and  some  diagonal  scaling  matrix  D  with  positive  diagonal  entries.  Equivalently, 
p  minimizes  the  quadratic  model 

gTr+1-p'T(JTJ  +  \D'TD)r,  (3.2) 

for  some  A  >  0.  Since  the  matrix  Jr  J  +  A DT D  is  positive  semidefinite,  minimizers  p\  of  (3.2) 
satisfy  the  equations 

{JTJ  +\DTD)p  =  -g  =  -JTf,  (33) 

which  are  the  normal  equations  for  the  linear  least-squares  problem 

(o)L'  {3A) 

Hence  a  regularization  method  (e.  g  ,  Chapter  25  of  Lawson  and  Hanson  [1974],  Elden  [1977, 
1984],  Varah  [1979],  and  Gander  [1981])  is  being  used  to  solve  the  linear  least-squares  problem 
(2.3)  for  the  step  to  the  next  iterate. 

The  paper  by  Levenberg  [1944]  is  the  earliest  known  reference  to  methods  of  this  type 
Based  on  the  observation  that  the  unit  Gauss-Newton  step  pGN  often  fails  to  reduce  the  sum  of 
squares  when  [|7>nAl.  |(  is  not  especially  small,  he  suggests  limiting  the  size  of  the  search  direction 
by  solvirg  a  “damped"  least-squares  subproblem, 

min  v(grp  +  ^  pTJTJp)  -I-  \\Dp\\\. 


(3.5) 


in  which  a  weighted  sum  of  squares  of  linearized  residuals  and  components  of  the  search  direction 
is  minimized  He  proves  the  existence  of  a  value  of  ui  for  which 

ll/('  +  MII2  <(!/(' )||2. 

where  }\.  solves  (3.5),  thus  ensuring  a  reduction  in  the  sum  of  squares  for  a  suitable  value  of  w. 
A  major  drawback  is  that  no  automatic  procedure  is  given  for  obtaining  u>.  Levenberg  suggests 
computing  the  value  of  J]/(z  +  ;>u  )||2  for  several  trial  values  of  u>,  locating  an  approximate 
minimum  graphically,  and  then  repeating  this  procedure  with  the  improved  estimates  until  a 
satisfactory  value  of  ur  is  obtained,  but  precise  criteria  for  accepting  a  trial  value  are  not  given. 
Two  alternatives  are  proposed  for  the  diagonal  scaling  matrix  D  in  (3.5)  D  =  I,  because  it 
minimizes  the  directional  derivative  gJ for  u>  =  0,  and  the  square  root  of  the  diagonal  of 
JT J ,  based  on  empirical  observations.  The  claim  is  that  the  new  method  solves  a  wider  class  of 
problems  than  methods  that  existed  at  that  time,  and  that  it  does  so  with  relative  efficiency. 

Somewhat  later,  a  similar  method  was  (apparently  independently)  proposed.  Morrison  [1960] 
considers  a  quadratic  model 

PTP  +  ^  (3-6) 

in  which  either  H  =  JTJ  or  H  =  5T2  (frf)  (in  the  latter  case,  it  is  implicitly  assumed 
that  V-  (fT f)  is  positive  semidefinite).  He  advocates  minimizing  (3.6)  over  a  neighborhood 
of  the  current  point  as  does  Levenberg,  because  (3.6)  may  not  be  a  good  approximation  to 
5  (ll/U  +  J')ll2  -  l!/(^)|l2)  minimizer  p*  is  large  in  magnitude,  and  consequently  the  sum 

of  squares  may  not  be  reduced  at  i  +  p* .  (In  Hartley  [1961],  a  linesearch  is  used  with  the 
Gauss-Newton  direction  for  the  same  reason.)  Morrison  proves  that  the  solution  p\  to 

min  (?Tp  4-  -  pT(  U  +  \D)p 
re*'"  2 

for  A  >  0  is  the  constrained  minimum  of  (3.G)  on  the  sphere  of  radius  ||J?p»||2.  and  that 
||pi||2  —  0  as  A  oo.  In  Morrison's  method,  the  step  bound  h  is  the  independent  parameter, 
rather  than  A.  No  specifications  are  given  for  either  h  or  D,  although  it  is  implied  that  they  can 
be  chosen  heuristically  for  a  given  problem  Instead  of  minimizing  (3.6)  subject  to  ||Z?p||2  <  6, 
constraints  of  the  form  | d,r,\  <  fi  are  imposed,  and  the  resulting  subproblem  is  then  solved  using 
the  eigenvalue  decomposition  of  //,  Although  the  theory  and  methods  apply  for  any  positive 
semi-definite  H  in  (3.6),  no  generalization  to  unconstrained  minimization  is  mentioned. 

Marquardt  [1963]  extended  Morrison's  work,  showing  that  the  vector  p\  that  solves  (3.3) 
becomes  parallel  to  the  steepest-descent  direction  as  A  —  oc,  so  that  f\  interpolates  between 
the  Gauss-Newton  search  direction  ;>0.  and  the  steepest-descent  direction,  p^  .  He  points  out 
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that  the  method  determines  both  the  direction  from  the  current  iterate  to  the  next  one,  and 
the  distance  between  the  iterates  along  that  direction,  and  that  increasing  A  decreases  the  step 
length,  while  shifting  the  direction  away  from  orthogonality  to  the  gradient  of  the  sum  of  squares 
Marquardt's  strategy  controls  A  automatically  by  multiplying  or  dividing  the  current  value  by  a 
constant  factor  v  greater  than  1.  He  maintains  that  the  minimum  of  the  Gauss-Newton  model 
should  be  taken  over  the  largest  possible  neighborhood,  that  is,  that  A  should  be  chosen  as  small 
as  possible,  so  as  to  achieve  faster  convergence  by  biasing  the  search  direction  toward  the  Gauss- 
Newton  direction  when  Gauss-Newton  methods  would  work  well.  Thus,  at  the  Jfcth  iteration. 
Xk  =  Xk_i/i>  is  tried  first,  and  then  increased  if  necessary  by  multiples  of  v  until  a  reduction 
in  the  sum  of  squares  is  obtained.  A  shortcoming  of  this  scheme  is  that  A  is  always  positive,  so 
that  the  constraint  in  (3.1)  is  active  in  every  subproblem,  and  consequently  a  full  Gauss-Newton 
step  can  never  be  taken.  Also,  no  efficient  method  is  given  for  solving  (3.3)  for  different  values 
of  A.  Motivated  by  statistical  considerations,  Marquardt  uses  the  diagonal  of  J1  J  for  the  scaling 
matrix  D  (one  of  the  alternatives  proposed  by  Levenberg),  and  mentions  that  this  scaling  has 
been  widely  used  as  a  technique  for  computing  solutions  to  ill-conditioned  linear  least-squares 
problems. 

Since  the  appearance  of  Marquardt’s  paper,  and  also  that  of  Goldfeld,  Quandt,  and  Trotter 
[1966],  which  independently  proposed  trust-region  methods  for  general  unconstrained  optimiza¬ 
tion,  much  research  has  been  directed  toward  improvements  within  the  framework  presented 
there.  Bard  [1970]  takes  the  eigenvalue  decompostion  of  JrJ  at  each  iteration,  so  that  (3.3) 
can  be  easily  solved  for  several  values  of  A,  and  so  that  it  will  be  known  whether  or  not 
is  singular.  Bartels,  Golub,  and  Saunders  [1970]  show  how  to  use  the  S\  D  of  J  instead  of 
the  eigenvalue  decomposition  for  the  same  purpose.  They  also  give  an  algorithm  for  computing 
A  given  b  that  involves  determining  some  eigenvalues  of  a  diagonal  matrix  after  a  symmetric 
rank-one  update.  Meyer  [1970]  discusses  the  use  of  a  linesearch  with  Marquardt's  method  (see 
also  Osborne  [1972])  Shanno  [1970]  selects  A  so  that  p\  is  a  descent  step  for  ||/(t)||2-  The 
value  A  =  0  is  tried  first,  and  then  increases  are  made  by  multiplying  a  threshold  value  by  a 
factor  greater  than  one  until  *'■'( A)  <  0,  where  t(,(A)  =  ||/(z  +  jn)||2.  In  addition,  a  linesearch  is 
also  used  when  cof(px.g)  is  above  a  threshold  value,  that  is,  when  p\  is  judged  to  be  nearly  in 
the  direction  of  -g.  Shanno’s  method  is  meant  for  general  unconstrained  or  linearly-constrained 
minimization,  as  well  as  for  nonlinear  least  squares. 

Several  methods  have  attempted  to  approximate  Levenberg-Marquardt  directions  by  a  vector 
that  is  the  sum  of  a  component  in  the  steepest  descent  direction,  and  a  component  in  the  Gauss- 
Newton  direction  p CN.  Jones  [1970]  combines  searches  along  a  spiral  arc  connecting  pc„  and 
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the  origin  wvith  parabolic  interpolation  in  order  to  obtain  a  decrease  in  the  sum  of  squares  If  a 
reduction  is  not  achieved  after  trying  several  arcs,  then  the  steepest  descent  direction  is  searched. 
The  method  of  Powell  [1970a]  for  nonlinear  equations  and  [1970b]  for  unconstrained  optimization 
searches  along  a  piecewise  linear  curve.  The  algorithm  for  unconstrained  optimization  requires 
some  agreement  between  the  reduction  predicted  by  the  quadratic  model  and  the  actual  reduction 
in  the  sum  of  squares  before  the  step  is  accepted.  Global  convergence  results  that  include  use  of 
the  quadratic  model  (2.1)  for  nonlinear  least  squares  are  given  in  Powell  [1975]  (see  also  More 
[1983]).  Steen  and  Byrne  [1973]  approximate  a  search  along  an  arc  that  intersects  g  at  a  nonzero 
point.  Their  algorithm  requires  that  ./T  J  be  scaled  so  that  its  smallest  eigenvalue  is  2,  which  they 
accomplish  by  computing  and  finding  either  j|(^T>^)_1||1  or  IK-77-7)  I*-  A 

of  unspecified  small  magnitude  is  added  to  JTJ  in  the  event  of  singularity.  A  difficulty  with  any 
algorithm  based  on  this  type  of  approach  is  that  it  is  not  clear  how  to  define  the  approximation 
when  the  Gauss-Newton  direction  is  not  numerically  well  defined. 

Fletcher  [1971]  implements  a  modified  version  of  Marquardt's  algorithm,  in  which  adjust¬ 
ments  in  the  parameter  A  are  made  on  the  basis  of  a  comparison  of  the  actual  reduction  in  the 
sum  of  squares 

\  (ll/(*  +  P>)llj-  ll/<*)ll’),  (3-7) 

with  the  reduction 

9rrx  +  ^pJJTJpx  (3-8) 

predicted  by  the  model  (3.2),  which  is  the  optimum  value  of  the  objective  in  (3.1)  (see  also 
Powell  [1970b]).  The  step  px  is  taken  only  when  there  is  sufficient  agreement  between  (3.7)  and 
(3.8),  instead  of  accepting  px  whenever  the  trial  step  results  in  a  reduction  in  the  sum  of  squares. 
Fletcher  also  introduces  more  complicated  techniques  for  updating  A.  The  scheme  for  decreasing 
A  differs  from  that  given  by  Marquardt  in  that  division  by  a  constant  factor  is  used  only  until  A 
reaches  a  threshold  value,  Ac,  below  which  it  is  replaced  by  zero.  This  modification  is  motivated 
by  a  desire  to  allow  the  Gauss-Newton  step  (A  =  0)  when  Gauss-Newton  methods  would  work 
well,  since  A  is  always  positive  in  Marquardt's  method,  and  to  allow  the  initial  choice  of  A  =  0 
rather  than  some  arbitrary  positive  value.  Because  numerical  experiments  show  that  multiplying 
by  a  fixed  constant  factor  may  be  inefficient,  Fletcher  uses  safeguarded  quadratic  interpolation 
to  increase  A  when  (3.7)  and  (3.8)  differ  substantially.  If  the  current  value  of  A  is  nonzero,  then 
it  is  divided  by  a  factor 


0.1. 
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where  o m * „  is  'he  minimum  of  the  quadratic  interpolant  to  the  function  cHo )  =  ||/( x  +  np)||j 
at  <4(0).  4>'( 0),  and  d>(l).  There  is  also  a  provision  to  increase  A  =  0  to  the  threshold  value  Af 
under  certain  circumstances.  The  choice  of  Ac  appears  to  be  a  major  difficulty. 

Fletcher  gives  some  theoretical  justification  for  choosing  Ar  to  be  the  reciprocal  of  the 
smallest  eigenvalue  of  Since  he  chooses  to  solve  (3.3)  directly  for  each  value  of 

A  via  the  Cholesky  factorization,  rather  than  compute  the  eigenvalue  decomposition  of  JTJ 
or  the  singular  values  of  J,  the  minimum  eigenvalue  of  JTJ  is  not  available  without  further 
computation.  He  therefore  updates  the  estimate  of  Ac  only  when  A  is  increased  from  0,  calculating 
(.7tJ)_1  from  the  Cholesky  factorization  of  JT J,  and  then  takes  either  Ac  =  1/  |j(7TJ)-1 1| ^ , 
or  Ac  =  l/iracc  ((JTJ)-1).  A  drawback  is  that  Ac  is  not  defined  when  JT  J  is  singular,  and  it 
is  not  well  defined  when  JTJ  is  ill-conditioned.  Harwell  subroutine  VA07A  is  an  implementation 
of  Fletcher’s  method.  It  allows  the  user  to  select  the  scaling  matrix  D,  which  then  remains  fixed 
throughout  the  computation.  The  default  for  the  scaling  matrix  is  the  square  root  of  the  diagonal 
of  JTJ  at  the  starting  value. 

An  efficient  and  stable  method  for  solving  (3.3)  for  several  values  of  A  based  on  the  linear 
least-squares  formulation  (3.4)  is  given  by  Osborne  [1972].  The  method  is  accomplished  in  two 
stages  First,  the  QR  factorization  of  J  is  computed,  to  obtain 

(o  =  (-V,0) 

after  which  a  series  of  elementary  orthogonal  transformations  are  applied  to  reduce  the  right-hand 
side  of  (3.10)  to  triangular  form.  Thus  it  is  only  necessary  to  repeat  the  second  stage  of  this 
procedure  when  the  value  of  A  is  changed,  provided  the  QR  factorization  of  J  is  saved.  In  a  later 
paper,  Osborne  [1976]  discusses  a  variant  of  Marquardt’s  algorithm  for  which  he  proves  global 
convergence  to  a  stationary  point  of  /T/  under  the  assumption  that  the  sequence  {A*}  remains 
bounded.  In  this  method,  he  uses  a  simple  scheme  similar  to  the  one  proposed  by  Marquardt 
to  update  A,  but  controls  adjustments  in  A  by  comparing  (3.7)  and  (3.8).  His  implementation 
takes  D  to  be  the  square  root  of  the  diagonal  of  J7  J,  as  in  Marquardt's  method. 

The  algorithm  of  More  [1978]  adjusts  the  step  bound  b  in  (3.1)  rather  than  A,  a  strategy 
used  in  trust-region  methods  for  unconstrained  optimization  (see  More  [1983]  for  a  survey). 
Changes  in  6  depend  on  agreement  between  (3.7)  and  (3.8);  increases  are  accomplished  by 
taking  <5*  +  I  =  2||Z?i7>t|!2,  while  b  is  decreased  by  multiplying  by  the  factor  7  defined  by  (3.9). 
In  order  to  obtain  A  when  the  bound  in  (3.1)  is  active,  the  nonlinear  equation 

*(  A)  =  \\Dpx\\3-6  =  |(./TJ  +  \DrD)~l  g\\2-b  =0  (3.11) 
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is  approximately  solved  by  truncating  a  safeguarded  Newton  method  based  on  the  work  of  Hebden 
[1973]  (see  also  Reinsch  [1971]).  More  reports  that,  on  the  average,  (3.11)  is  solved  fewer  than 
two  times  per  iteration.  Also,  he  proves  global  convergence  to  a  stationary  point  of  /T/.  without 
assuming  boundedness  for  {At}.  Many  computational  details  are  given,  including  an  efficient 
method  for  calculating  the  derivative  of  'J’(A)  in  (3.11)  that  uses  the  QR  factorization  of  J. 
A  modification  of  the  two-stage  factorization  described  in  Osborne  [1972]  that  allows  column 
pivoting  is  used  to  solve  (3.3).  Subroutine  LMDER  in  MINPACK  [More,  Garbow,  and  Hillstrom 
(1980)]  is  an  implementation  of  the  method.  Variables  are  scaled  internally  in  LMDER  according 
to  the  following  scheme:  the  initial  scaling  matrix  Dq  is  the  square  root  of  the  diagonal  of  Jr  J 
evaluated  at  t0,  and  the  rth  diagonal  element  of  D*  is  taken  to  be  the  maximum  of  the  rth 
diagonal  element  of  Di,-\  and  the  square  root  of  the  rth  diagonal  element  of  JTJ.  Numerical 
results  are  presented  indicating  that  this  scaling  compares  favorably  with  those  used  by  Fletcher, 
and  by  Marquardt  and  Osborne.  The  user  also  has  the  option  of  providing  an  initial  diagonal 
scaling  matrix  that  is  retained  throughout  the  computation. 

Nazareth  [1980,  1983]  describes  a  hybrid  method  that  combines  a  Levenberg-Marquardt 
method  with  a  quasi-Newton  approximation  Hk  to  the  full  Hessian.  The  search  directions  solve 
a  system  of  the  form 

(0kjJ Jk  +  (1  -  0*)/7*  +  A kDk  Dk)  V  —  -gk, 

with  fit  €  [0.  1]  and  A*  >  0.  He  compares  the  reduction  in  the  sum  of  squares  predicted  by  both 
the  Levenberg-Marquardt  and  quasi-Newton  models  with  the  actual  reduction,  and  then  chooses 
9k  on  the  basis  of  this  comparison.  In  Nazareth  [1983],  a  simple  version  of  the  hybrid  strategy  is 
implemented  that  uses  Davidon's  optimally  conditioned  update,  with  Dk  =  / ,  and  a  variation  of 
Fletcher's  [1971]  method  for  updating  A.  Results  are  reported  for  a  set  of  eleven  test  problems  — 
including  five  problems  with  nonzero  residuals  —  and  compared  to  the  use  of  the  algorithm  as  a 
quasi-Newton  method  (9k  =  0)  or  a  Levenberg-Marquardt  method  (9k  =  0-  He  concludes  that 
the  hybrid  method  is  somewhat  better  for  the  problems  with  nonzero  residuals,  and  recommends 
development  of  a  more  sophisticated  implementation. 

4.  Corrected  Gauss-Newton  Methods 

Gill  and  Murray  [1976]  propose  a  linesearch  algorithm  that  divides  9?"  into  complementary 
subspaces  V  and  .V\  where  7?  C  7J(./T),  and  .V  is  nearly  orthogonal  to  7v(JT).  The  search 
direction  is  the  sum  of  a  Gauss-Newton  direction  in  7?,  and  a  projected  Newton  direction  in 
A\  This  strategy  avoids  a  shortcoming  of  Gauss-Newton  methods  —  that  components  of  the 
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search  direction  that  are  nearly  orthogonal  to  7\(JT)  may  not  be  well  determined  when  J  is 
ill-conditioned  —  because  each  component  is  computed  from  a  reasonably  well-conditioned  sub¬ 
problem.  The  vector  7  —  7*  may  become  almost  entirely  in  7?(JT)  in  a  Gauss-Newton  method, 
yet  the  algorithm  computes  a  search  direction  that  is  virtually  orthogonal  to  7?(./T)  due  to  ill 
conditioning  in  the  Jacobian  (see  Fraiey  (1987b]).  Gill  and  Murray  show  that  both  Gauss-Newton 
algorithms  defined  by  (2.4)  and  Levenberg-Marquardt  algorithms  generate  search  directions  that 
lie  in  7 \(JT),  while  the  Newton  search  direction  generally  will  have  a  component  in  A  '{J),  the 
orthogonal  complement  of  7v(JT),  whenever  J  has  linearly  dependent  columns.  For  problems 
with  small  residuals,  they  point  out  that  JTJ  is  a  reasonable  approximation  to  the  full  Hessian 
in  but  not  in  A^J).  Thus,  in  situations  where  x  —  7*  is  orthogonal  to  7?(JT),  and  J 

is  well-conditioned  but  has  linearly  dependent  columns  (for  example,  when  m  <  n),  the  Gauss- 
Newton  and  Levenberg-Marquardt  directions  have  no  component  in  the  direction  of  7  —  7*,  while 
Newton's  method  and  also  the  method  of  Gill  and  Murray  would  have  components  in  both  7?(  JT) 
and  A  "(J). 

The  basic  idea  of  the  method  is  as  follows.  Suppose  that 

J  =  QTVr  (4.1) 

is  an  orthogonal  factorization  of  J,  in  which  T  is  triangular  with  diagonal  elements  in  decreas¬ 
ing  order  of  magnitude  (either  a  QR  factorization  with  column  pivoting  or  the  singular-value 
decomposition).  Let 

r  =  ( y  Z)  (4.2) 

be  a  partition  of  V'  into  the  first  gradc(J)  columns  and  the  remaining  rr  —  gradr(J)  columns. 
The  columns  of  }’  form  an  orthonormai  basis  for  7J,  and  those  of  Z  form  an  orthonormal  basis 
for  A  .  The  Newton  search  direction  for  the  nonlinear  least-squares  problem  is  given  by 

(JTJ  +  B)r=  - Jrf . 


or,  equivalently. 


Vr(JTJ  +  B)p~  - VTJTf , 


since  V  is  nonsingular.  Using  (4.2),  equation  (4.3)  can  be  split  into  two  equations: 


YT(JrJ  +  B)p=  -YTJrf, 


>  »*■  »* 


ZT(JTJ+  B)t>=  -ZTJTf.  (4.5) 

Substituting  ;>  =  Vp,  +  Z/>,  into  (4.4)  yields 

YTJTJYPr  +  YTJTJZpe  +  VTfl/>  =  -V  TJT/. 

Since  gradr(.l)  is  chosen  to  approximate  rank(J),  ||JZ||  is  presumed  to  be  zero,  so  that 
YJ  JT  J  Zpz  vanishes.  Also,  for  zero  residual  problems,  the  term  YTBp  would  be  small  near 
a  minimum  relative  to  Yr  JT JY j),  ,  since  ||B||  approaches  zero.  Defining  f  to  be  ||j  -  x*||, 
where  x*  is  a  minimum  at  which  the  residuals  are  zero,  and  assuming  ||/||  =  G(r)  we  have 

YTJTJYpy  =  <P(0;  YTBp  =  G(f2);  YTJTf  =  0((). 

The  range-space  component  of  the  search  direction  is  therefore  chosen  to  satisfy 

Y1  JrJYpy  =  -YTJTf.  (4.6) 

With  grndr(.l)  =  Tank(J),  the  vector  Ypy  is  the  minimal  /2-norm  least-squares  solution  to 
Jp  ss  -/,  and  is  therefore  a  Gauss-Newton  direction.  For  the  null-space  portion,  since  JZ  =  0 
is  assumed,  (4.6)  reduces  to 

Zr  Bp  =  0, 

which  may  be  solved  for  Zpz  given  Vjr,  from  (4.5)  using 

ZTBZp2  =  -ZTBYpy.  (4.7) 

When  exact  second  derivatives  are  not  available,  the  use  of  finite  difference  approximations  along 
the  columns  of  Z  is  suggested. 

A  version  of  this  algorithm  called  the  corrected  Gauss-Newton  method  (Gill  and  Murray 
(1978)]  forms  the  basis  for  the  nonlinear  least-squares  software  currently  in  the  NAG  Library 
[1984],  It  uses  the  singular-value  decomposition  of  J,  rather  than  a  QR  factorization.  Rules 
based  on  the  relative  size  of  the  singular  values  are  given  for  choosing  an  integer  gradc(J)  to 
approximate  rnnk(.l).  and  an  attempt  is  made  to  group  together  singular  values  that  are  similar 
in  magnitude.  The  method  is  not  as  sensitive  to  grndr(J)  as  Gauss-Newton  is  to  rank  estimation, 
both  because  of  the  division  of  the  computation  of  the  search  direction  into  separate  components 
in  7?  and  A\  and  because  grndc(J)  is  varied  adaptively  based  on  a  measure  of  the  progress  of 
the  minimization  Moreover,  the  rate  of  convergence  is  potentially  faster  than  Gauss-Newton 
or  Levenberg-Marquardt  methods  on  problems  with  nonzero  residuals.  The  quantity  grade(J) 
is  reduced  when  the  sum  of  squares  is  not  adequately  decreasing,  so  that  there  is  the  potential 
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of  having  .V"  =  5?"  (with  exact  second  derivatives,  this  implies  taking  full  Newton  steps)  in  the 
vicinity  of  a  solution.  The  derivation  below  shows  how  the  corrected  Gauss-Newton  method 
differs  from  the  earlier  version  based  on  the  QR  factorization. 

Because  of  (4.1),  J 1  J  can  be  written  as  VTtTVt,  so  that  (4.3)  is  equivalent  to 

TrTVrp  +  VT  Bp  =  -TTQrf.  (4.8) 

Using  p  =  Ypy  +  Zpz,  along  with 

VTr  =  ( It”W)\  and  VTZ=(,  0  V 

\  0  J  \In-fTa<it(J)  ) 


(4.8)  becomes 


t't  +  r*r  (W°w))  +  i'T  bf  =  -T'Q'f.  <«-»» 

If  we  let 

f  _  (Tu  ^12^ 

"  \t21  T22) 

be  a  partition  of  T,  where  T,  1  is  the  submatrix  consisting  of  the  first  k  rows  and  columns  of  T, 
then 

tTt_  ( (TiiTn  +  T^Jj,)  ( T, 2  +  T21 T22)\ 

v ar, ii  +  t22t21 )  (t?2t12  +  t22t22) ) 

and  (4.9)  can  be  split  into  two  equations  : 

(T^Tn  +  TlT2l)Py  +  (T^TU  +  T^T22)pz  +  YT Bp  =  -(T*  T?2  ) QTf,  (4.10) 


(T?2Tu  +  T?2T2l)py  +  (T?2Tl2  +  T?2T22)p,  +  Z1  Bp  —  -  (Tl2  Tj2 )  Qrf.  (4.11) 

As  in  the  earlier  version,  the  term  VT Bp  is  ignored  in  (4.10).  Moreover,  in  the  case  that  (4.1)  is 
the  singular-value  decomposition,  both  T12  and  Tti  vanish  and  the  two  equations  can  be  further 
simplified  to 

S?/>>  =  -  (  Sj  0  )Qt/,  (4.12) 


where 


S2pz  +  ZT Bp  =  -  (0  52)C?T/, 


•S  =  Tn  and  52  =  T22. 


(4.13) 


Note  that  S\  and  S2  are  diagonal  matrices,  and  that  the  />,  term  in  the  second  equation  could 
not  be  ignored  if  (4.1)  were  a  triangular  factorization  of  J,  because  then  (T^2TIX  +  T22T2l) 


couid  not  be  assumed  negligible  relative  to  {Ty^T^  +  T^T^)-  The  equations  that  are  ultimately 
solved  are 

5,Pv=-(W(J)  °)QT/>  (414) 

and 

(S*  +  ZrBZ)pt  =  -  ( 0  5j)Qt/  -  ZTBpy.  (4.15) 

The  matrix  5|  +  Zr BZ  is  replaced  by  a  modified  Cholesky  factorization  if  it  is  computationally 
singular  or  indefinite.  The  range-space  component  is  a  Gauss-Newton  search  direction,  wh' t,  in 
the  positive-definite  case,  the  null-space  component  is  a  projected  Newton  direction. 

When  no  modification  is  necessary,  the  subproblem  being  solved  is 

min  gTp  +  ]■  pT(  J  +  B)p  (416) 

re®"  2 

subject  to  Jp  ~  -/, 

where  is  taken  in  a  least-squares  sense  if  the  rows  of  J  are  linearly  dependent,  as  in  the  case 
when  777  >  77,  and  otherwise  as  equality.  Subproblem  (4.16)  is  an  equality  constrained  quadratic 
program.  When  rank(J)  =  gradc(J)  —  n,  its  solution  is  a  full-rank  Gauss-Newton  direction 
that  is  completely  determined  by  the  constraints  in  (4.16).  When  rank(J)  =  gradc(J)  <  n, 
the  search  direction  is  computed  as  the  sum  of  two  mutually  orthogonal  components,  defined  by 
equations  (5.3.14)  and  (5.3.15).  In  this  case  S2  =  0,  so  that  the  projected  Hessian  in  (5.3.15) 
is  ZT  BZ  and  therefore  involves  only  the  second  derivatives  of  the  residuals.  We  shall  return  to 
this  point  in  Section  7,  when  we  discuss  SQP  methods  for  nonlinear  least  squares. 

Although  the  range-space  component  solving  (4.14)  can  never  be  a  direction  of  increase 
for  /T/  (see  Fraley  [1987a]),  the  search  direction  computed  by  (4.14)  and  (4.15)  may  not  be  a 
descent  direction  for  /T/,  regardless  of  whether  or  not  Sf  -I-  ZT BZ  is  modifed,  on  account  of  the 
71,  term  in  (4.15).  Thus,  if  |co,s(j,p)|  is  smaller  than  some  prescribed  value,  or  if  grp  is  positive, 
then  a  modified  Newton  search  direction  (corresponding  to  the  case  gradc{J)  =  0)  is  used 
instead.  A  finite-difference  approximation  to  the  projected  matrix  ZT BZ  along  the  columns 
of  Z,  and  a  quasi-Newton  approximation  to  B  (see  the  discussion  in  Section  5)  are  given  as 
alternatives  to  handle  cases  in  which  second  derivatives  of  the  residual  functions  are  not  available 
or  are  difficult  to  compute.  Gill  and  Murray  test  their  method  on  a  set  of  twenty-three  problems, 
and  find  that  when  quasi-Newton  approximations  to  B  are  used,  the  algorithm  does  not  perform 
as  well  as  it  does  with  exact  second  derivatives  or  finite-difference  approximations  to  a  projection 
of  B.  They  observe  only  linear  convergence  for  the  quasi-Newton  version  on  problems  with  large 
residuals.  The  algorithms  are  implemented  in  the  NAG  Library  [1984]  in  the  NAG  Library  [1984] 
subroutine  E04HEF  uses  exact  second  derivatives,  while  subroutine  E04GBF  is  the  quasi-Newton 


version 
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5.  Special  Quasi- Newton  Methods 

Another  approach  to  the  nonlinear  least-squares  problem  is  a  based  on  a  quadratic  model 

9TP+  ^P7(jTj  +  B)p. 

where  B  involves  quasi-Newton  approximations  to  the  term 

m 

5(t)  =  £>,(/)  v2<Mr) 

i=i 

in  the  Hessian  of  the  nonlinear  least-squares  objective.  Brown  and  Dennis  [1971]  first  proposed 
a  method  in  which  the  Hessian  matrix  of  each  of  the  residuals  was  updated  separately.  This 
technique  is  impractical  because  it  entails  the  storage  of  ?n  symmetric  matrices  of  order  n,  and 
more  recent  research  has  aimed  to  approximate  B  as  a  sum. 

Dennis  [1973]  suggests  choosing  the  updates  to  satisfy  a  quasi-Newton  condition 

Bk+ifk  =  Vk  ~  ^7+iA  +  i5*--  (51) 


where 


.**  =  /Jt+i  -  Ik  and  Vk  =  9kJr\  -  gk  ■ 
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It  is  implied  that  the  update  can  then  be  chosen  as  in  the  unconstrained  case,  although  there 
is  some  ambiguity  as  to  how  this  should  be  done.  One  possibility  is  to  update  Bk  directly  to 
obtain  5*  +  i.  subject  to  a  quasi-Newton  condition  such  as  (5.1)  on  Bk+i*k-  Another  approach 
consistent  with  Dennis'  description  is  to  modify  77*  =  +  1  -f  Bk,  requiring  the  updated 

matrix  Hk  +  i  to  satisfy  a  quasi-Newton  condition 

ft  *+i5*  =  y*  •  (52) 

Then  Bk  +  i  =  //*  +  i  —  •7*T+I^*  +  1  is  the  new  approximation  to  B  at  r*  +  1.  Depending  on 
the  update  and  quasi-Newton  conditions,  the  two  alternatives  may  not  yield  the  same  result. 
Moreover,  updates  defined  by  minimizing  the  change  in  the  inverse  of  Bk,  such  as  the  BFGS 
update  to  Bk,  make  no  sense  in  this  context,  since  the  matrix  B  would  not,  by  itself,  be  expected 
to  be  invertible. 

Betts  [1976]  implements  a  linesearch  method  in  which  the  symmetric  rank-one  update  (see 
Dennis  and  More  [1977])  is  applied  to  B,  with  the  quasi-Newton  condition 

Bk  +  i-'k  =  Vk  ~  Jk-tk-'i'-  (5  3) 
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This  scheme  is  equivalent  to  applying  the  symmetric  rank-one  formula  to  the  matrix  7? = 
with  the  updated  matrix  H  k  +  \  satisfying  (5.2),  and  then  taking  Bk 41  =  /?*  +  i  -7^7, t. 
He  compares  this  algorithm  with  a  Gauss-Newton  method,  and  also  with  a  hybrid  algorithm 
that  starts  with  Gauss-Newton,  switching  to  the  augmented  Hessian  /?»  when  the  iterates  are 
judged  to  be  sufficiently  close  together  to  be  near  a  solution.  It  is  not  clear  whether  the  update 
is  performed  when  B  is  not  used  in  the  hybrid  method.  Betts  reports  observing  quadratic 
convergence  for  the  special  quasi-Newton  methods.  For  further  discussion  of  these  results,  see 
Section  2. 

Bartholomew-Biggs  [1977]  compares  the  PSB  update  (see  Dennis  and  More  [1977])  and  the 
symmetric  rank-one  update  applied  directly  to  5  in  1  linesearch  method.  These  updates  are 
tested  with  the  quasi-Newton  condition  (5.1),  as  well  as  with  the  condition 

Bk+ifik  =  7jir+j/jf  +  i  —  (5 -4) 

which  is  derived  from  the  relation 

m  tn 

W  =  ^<£.-(n  +  i)  [^>.(**  +  1)  -  V^t*)  +  P(IKII') 

1=1  1=1 

~  J*T+1/»+,  -  •**7*  41 

(see  also  Dennis  [1976])  Bartholomew-Biggs  points  out  that,  in  general,  quasi-Newton  ap¬ 
proximations  to  B  may  not  adequately  reflect  changes  that  are  due  to  the  contribution  of  the 
residuals.  For  example,  when  each  residual  function  is  quadratic,  and  consequently  each  V2d>, 
is  constant,  fli  +  i  may  differ  from  i?*.  by  a  matrix  of  rank  n.  For  this  reason,  he  does  some 
experiments  with  updating  tBi,  for  r  =  f?+1fk/fjfk,  which  is  the  appropriate  scaling  for  the 
special  case  in  which  /V  + 1  =  r/*  and  the  </>;  are  quadratic.  In  his  implementation,  a  Levenberg- 
Marquardt  step  is  used  whenever  the  linesearch  fails  to  produce  an  acceptable  reduction  in  the 
sum  of  squares  and  co.*(g,p)  >  — 10~4.  The  scaled  symmetric  rank-one  update  with  (5.4)  is 
selected  to  compare  with  other  methods  after  preliminary  tests,  because  it  exhibited  the  best 
overall  performance,  and  required  fewer  Levenberg-Marquardt  steps.  The  other  methods  tested 
include  a  Gauss-Newton  method,  a  method  that  combines  Gauss-Newton  with  a  Levenberg- 
Marquardt  method,  an  implementation  of  Fletcher's  [1971]  Levenberg-Marquardt  method,  and 
a  quasi-Newton  method  for  unconstrained  optimization.  All  of  the  fourteen  test  problems  have 
nonzero  residuals  Bartholomew-Biggs  finds  that  the  special  quasi-Newton  method  is  more  robust 
than  the  other  specialized  methods  for  nonlinear  least-squares,  and  that  it  is  particularly  suitable 
for  problems  with  large  residuals  He  also  observes  that  on  problems  on  which  the  Gauss-Newton 
and  Levenberg-Marquardt-based  methods  perform  poorly,  the  special  quasi-Newton  method  is 
more  effective  than  the  quasi-Newton  method  for  general  unconstrained  optimization.  Nothing 
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is  said  about  the  observed  rate  of  convergence  for  any  of  the  methods.  He  concludes  that  further 
research  is  needed  to  determine  the  best  updating  strategy,  some  desirable  features  being  hered¬ 
itary  positive  definiteness,  and  the  ability  to  update  a  factorization  of  B .  Finally,  he  indicates 
that  it  would  be  worthwhile  to  develop  a  hybrid  method  combining  Gauss-Newton  with  a  special 
quasi-Newton  method,  in  order  to  avoid  the  cost  of  the  updates  on  problems  that  are  easily 
solved  by  Gauss-Newton  methods. 

Gill  and  Murray  [1978]  discuss  a  linesearch  method  in  which  they  use  the  augmented  Gauss- 
Newton  quadratic  model  only  to  compute  a  component  of  the  search  direction  in  a  subspace 
that  approximates  the  null  space  of  the  Jacobian  (see  the  preceding  section).  They  apply  the 
BFGS  formula  for  unconstrained  optimization  (see  Dennis  and  More  [1977])  to  the  matrix  //*  — 
AT+r7*4i  +  5*  with  the  q  uasi-Newton  condition  (5.2),  and  then  form  B*  +  i  =  /K  +  i  -^a  +  i^a  +  i- 
The  choice  of  the  BFGS  update  is  based  on  performance  comparisons  to  a  number  of  other 
updates,  including  the  symmetric  rank-one  update  and  Davidon's  optimally-conditioned  update 
[Davidon  (1975)],  as  well  as  the  symmetric  rank-one  update  applied  to  //*  =  JjJk  +  B*  us*d 
in  Betts  [1976].  They  point  out  that,  if  JJ^+].7i  +  )  4-  Bk  is  positive  definite,  and  yj **  >  0,  then 
Jk  +  j  +  B*  + 1  is  also  positive  definite  with  this  scheme.  In  order  to  safeguard  the  method, 
the  projected  approximate  Hessian  is  replaced  by  a  modified  Cholesky  factorization  when  it  is 
singular  or  indefinite.  In  addition,  if  co.*(p,g)  exceeds  a  fixed  threshold  value,  a  modified  Newton 
step  with  the  full  augmented  approximate  Hessian  is  taken.  See  Section  4  for  a  summary  of  their 
observations  on  the  performance  of  the  methods. 

Dennis,  Gay,  and  Welsch  [1981a]  apply  a  scaled  DFF  update  (see  Dennis  and  More  [1977]) 


to  Bk  at  each  step. 

The  new  approximation  Bk+i  solves 

min\\H-l!7(TkBk-  B)H~'I-\\f 

(5.5) 

subject  to 

][ ,«*  =  jy*.  ;  77  positive  definite 

(5.6) 

Buy  =  AT+iA  +  i  -  -7*7*41  :  B  symmetric. 

(5.7) 

where 

r*  =  minfjiij**/.**  B***|,  1}. 

(5.8) 

The  scale  factor  r*  is  based  on  the  observation  that  the  quasi-Newton  approximation  to  B  is 
often  too  large  with  the  unsealed  update,  on  account  of  the  contribution  of  the  residuals.  The 
term  \yji>k /*J  Bk I  'n  rk  's  derived  from  the  self-scaling  principles  for  quasi-Newton  methods 
of  Oren  [1973],  and  attempts  to  shift  the  eigenvalues  of  the  approximation  Bk  to  overlap  with 
those  of  Rk,  using  new  curvature  information  at  z*.  This  method  forms  the  basis  for  the  .ACM 
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computer  program  NL2S0L  [Dennis,  Gay,  and  Welsch  (1981b)],  which  is  distributed  by  the  PORT 
Library  [1984]  as  subroutines  H2G  and  DN2G.  It  is  implemented  as  an  adaptive  method,  in  that 
Gauss-Newton  steps  are  taken  if  the  Gauss-Newton  quadratic  model  predicts  the  reduction  in 
the  function  better  than  the  quadratic  model  that  includes  the  term  involving  B.  A  trust-region 
strategy  is  used  to  enforce  global  convergence.  Numerical  results  are  given  in  Dennis,  Gay,  and 
Welsch  [1981a]  for  a  set  of  twenty-four  test  problems,  many  with  two  or  three  different  starting 
values,  medskip 

Al-Baali  and  Fletcher  [1985]  describe  some  linesearch  methods  that  are  similar  to  the  method 
of  Dennis,  Gay,  and  Welsch  [1981a]  discussed  above.  They  observe  that  the  DFP  update  defined 
by  (5.5)  -  (5.8)  is  equivalent  to  finding  //*  +  i  to  solve 


subject  to 


min  || H  lf‘(Jj+\Jk  +  i  +  +  rkBk  - 
h.h 


If  nk  =  yk  ;  H  positive  definite 
J 1*k  =  4T+iA  +  i  “  +  i  +  Ji  +  vh+v'*  ;  11  symmetric. 


(5-10) 


where 


rk  =  min{|yj5i/njB*n|,  1 }, 


and  then  forming 


R*  +  i  =  //i  +  i  -  1  Jk  + 1  - 


Moreover,  they  use  the  condition 


H *k  =  y*  :  H  positive  definite, 


(5.11) 


n  =  J*T+1  ^  +  .8*  +  AT+iA  +  !  -  Jkh- M  =  Vk  +  Oi IM2)  (5.12) 

as  an  alternative  to  (5.10).  and  mention  that  (5.10)  has  been  replaced  by  (5.11 )  in  newer  versions 
of  NL2S0L.  The  claim  is  that  the  updated  matrix  is  almost  always  positive  definite.  However,  if 
the  matrix  .7^+ 1 A +i  +  rkBk  is  not  positive  semi-definite,  rk  is  replaced  by  a  quantity  f*  that 
is  calculated  by  a  method  similar  to  a  Rayleigh  quotient  iteration,  so  that  Jk +  1Jt  +  1  +  hBk  is 
positive  semi-definite  and  singular.  A  corresponding  BFGS  method  is  also  given  in  which  the 
update  is  defined  by 

min  ||//~,/2((./*T+]  ^*  +  i  +  rkBk)~i  -  H  '  )//"1/2||f 

H  U 

instead  of  (5.H).  They  conclude  from  computational  tests  (described  in  Al-Baali  [1984])  that  their 
method  is  somewhat  more  efficient  in  terms  of  the  number  of  Jacobian  evaluations  than  KL2S0L, 
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but  requires  more  function  evaluations,  and  that  there  is  no  significant  difference  between  the 
DFP  and  RFC1S  updates  Al-Baali  and  Fletcher  also  introduce  scaling  factors  based  on  finding 
a  measure  of  the  error  in  the  inverse  Hessian.  They  observe  that,  for  the  BFflS  update  for 
unconstrained  optimization, 


II  V/2(//f 1  -  tfr+Ii)";,/2|lr  = 


where 


K  ,  u  x  _  {VkHk  Vk\  „  Uk**  ,  , 

±k(Hk:yk)=  M’-r —  ~2^rb —  +  1- 

V  !/*  •'*  J  *k  Hk*k 


(•VIS) 


Hence  an  "optimal"  value  of  r  can  be  found  by  minimizing  A  *  ( .7,T+ j  + ,  +  T&k)  **  *  function 
of  r.  Newton's  method  is  used  to  find  r,  an  iterative  process  that  requires  factorization  of 
^  +  for  each  intermediate  value  of  r.  They  were  apparently  unable  to  draw  any 

broad  conclusions  from  numerical  experiments  with  this  scaling,  and  refer  to  Al-Baali  [1984]  for 
details. 

A  convergence  analysis  for  minimization  algorithms  based  on  a  quadratic  model  in  which  part 
of  the  Hessian  is  computed  by  a  quasi-Newton  method  is  given  by  Dennis  and  Walker  [1981]  (see 
also  Chapter  11  of  Dennis  and  Schnabel  [1983]).  These  results  are  restricted  to  methods  that 
satisfy  a  least-change  condition  on  the  matrix  Bk  (analogous  to  the  FSB  and  DFP  updates). 
Only  a  fairly  mild  assumption  is  needed  to  prove  superlinear  convergence  to  an  isolated  local 
minimum  7*  :  that  the  vector  y®  in  the  quasi-Newton  condition 

Bk  *h  =  V  k 

be  chosen  so  that  the  norm  of  the  update  is 

C>(max{|j7*  -  7*||r  -  7*||r}), 

for  some  p  >  0.  This  assumption  is  satisfied  for  y £  in  each  quasi-Newton  codate  to  Bk  described 
above.  Their  treatment  of  inverse  updates  is  for  the  case  in  which  part  of  the  inverse  Hessian  is 
computed,  and  hence  does  not  apply  here.  To  the  best  of  our  knowledge,  no  convergence  results 
have  yet  been  proven  for  scaled  versions  of  the  updates,  or  for  updates  to  ,/j  +  1  Tl  +  1  +  Bk  that 
are  not  equivalent  to  some  direct  quasi-Newton  update  to  Bk 

6.  Conjugate-Gradient  Acceleration  of  Gauss-Newton  Methods 

Ruhe  [1979]  uses  preconditioned  conjugate  gradients  to  speed  up  convergence  of  Gauss- 
Newton  methods.  General  references  on  conjugate  gradients  include  Fletcher  [1980],  Chapter  4, 
and  Gill,  Murray,  and  Wright  [1981],  Chapter  4  We  give  a  brief  explanation  below 
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The  linear  conjugate  gradient  method  minimizes  an  7J-variate  quadratic  function 

0(/)  =  qJr+l  prHp, 


in  at  most  v  iterations  The  iteration  is 


where 


pk  =  ~9k  +  flk-iPk-\'. 


z*  +  i  =  *k  +  (tkPk 


_  IM,  „  Il»4l|g 

'~rJHr>-  '‘"ST 


9k  =  v  Q(rk)  =  q  +  H  7*  +  i- 

The  method  produces  a  sequence  of  search  directions  that  are  H -conjugate,  that  is 

PiHpj  —  0  if  i£j. 

The  number  of  iterations  needed  to  minimize  Q  by  conjugate  gradients  (with  exact  arithmetic) 
is  equal  to  the  number  of  distinct  eigenvalues  of  H .  The  idea  of  preconditioning  is  to  transform 
//  into  a  matrix  whose  eigenvalues  are  nearly  identical  in  magnitude.  If  a  positive-definite  matrix 
IT  is  used  as  a  preconditioner,  then  convergence  occurs  in  the  same  number  of  steps  that  wo  H 
be  taken  for  a  quadratic  function  with  the  Hessian  matrix 

W-'I’HW-*'2. 

The  ideal  preconditioner  would  be  IT  =  //,  but  since  conjugate  gradients  are  competitive  mainly 
when  7)  is  large,  an  approximation  that  is  relatively  inexpensive  to  factorize  is  used.  For  a 
smooth  nonlinear  function  T{i).  the  conjugate  gradient  method  (6.1)  can  also  be  applied,  with 
9k  =  )  and  ok  determined  by  a  linesearch,  with  safeguards  to  ensure  descent.  There  are 

several  possible  choices  for  ftk  that  are  equivalent  to  the  one  given  above  for  the  quadratic  case 
(see,  for  example,  Fletcher  [1981],  Chapter  4).  The  method  is  often  restarted  every  rt  iterations 
on  account  of  the  variation  in  V2T'(z )  for  non-quadratic  functions  (e.  g.,  Gill,  Murray,  and  Wright 
[1981],  Chapter  4).  Preconditioners  for  the  non-quadratic  case  attempt  to  approximate  V2^(z). 

In  Ruhe  s  algorithm,  the  matrix  ./TJ  is  used  as  the  preconditioner,  and  an  orthogonal  fac¬ 
torization  of  J  is  used  to  compute  the  necessary  quantities.  The  method  is  applied  to  problems 
in  which  the  residuals  are  nonzero  and  the  Jacobian  has  full  rank,  and  is  restarted  every  n  it¬ 
erations.  He  concludes  that  the  preconditioned  conjugate-gradient  method  never  increases  the 
total  number  of  iterations  required  to  solve  a  given  problem  relative  to  Gauss-Newton,  and  that 


i 


significant  improvements  in  the  speed  of  linear  convergence  of  Gauss-Newton  on  large-residual 
problems  can  be  achieved  with  conjugate-gradient  acceleration. 

Al-Baali  and  Fletcher  [1985]  point  out  that  conjugate-gradient  acceleration  of  the  type  de¬ 
scribed  by  Ruhe  is  equivalent  to  applying  a  RFGS  update  to  the  Gauss  Newton  approximate 
Hessian  JT,7  at  each  step.  They  implement  and  test  both  this  method  (without  restarts)  and  a 
scaled  version,  where  the  scale  parameter  r  is  chosen  to  minimize  Jk  ;jr*)  as  a  function 

of  r  (see  (5.-4.M)).  They  give  no  conclusions  as  to  the  relative  efficiency  of  the  scaled  and 
unsealed  versions  of  the  method,  but  find  that  the  modified  methods  offer  some  improvement 
over  Gauss-Newton,  while  exhibiting  the  same  difficulties. 


7.  Sequential  Quadratic  Programming  (SQP)  Methods 


Fraley  [1987a]  proposes  algorithms  that  solve  quadratic  programming  subproblems  whose 
formulation  is  based  on  convergence  properties  of  sequential  quadratic  programming  methods 
for  constrained  optimization,  and  on  geometric  considerations  in  nonlinear  least  squares.  The 
motivation  behind  these  methods  is  as  follows.  Recall  that  the  Hessian  matrix  of  the  least-squares 
objective  can  be  separated  into  the  sum  of  two  components  involving  different  types  of  derivative 
information  : 

V2(j/T/)  =JTJ  +  B' 

where 

?n 


>=i 


The  corrected  Gauss-Newton  methods  (Section  4)  calculate  a  search  direction  that  is  separated 
into  two  orthogonal  components  when  0  <  gradr(J)  <  n,  and  can  be  viewed  as  SQP  methods. 
When  gT<idr{J)  —  rank(J)  <  v,  the  contributions  of  JTJ  and  of  B  (or  of  an  approximation  to 
B )  are  essentially  decoupled  because  the  contribution  of  .7T7  in  the  projected  Hessian  is  zero.  No 
such  separation  is  possible  when  ravk(J)  =  rr.  In  any  case,  gradc(J)  <  *?  may  be  selected  based 
on  the  progress  of  the  minimization  as  well  as  the  singular  values  of  .7,  so  that  partial  separation  of 
./T  J  and  B  may  occur  between  the  extremes  of  Gauss-Newton  ( gradr{J )  =  rank(J)),  and  a  full 
Newton-type  method  (gradr(J)  =  0).  The  strategy  of  making  a  quasi-Newton  approximation 
to  B  which  is  then  added  to  J1  J  in  a  full  Newton-type  method  has  not  been  successful  outside 
a  neighborhood  of  the  solution,  unless  it  is  combined  with  other  techniques  (see  Section  5). 
The  approach  taken  in  Fraley  [1987a]  is  to  use  a  quasi-Newton  approximation  to  the  full  Hessian, 
while  separating  out  some  of  the  contribution  to  the  curvature  due  to  .7T7  by  including  first-order 
information  about  the  residuals  as  constraints. 
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A  search  direction  is  computed  as  the  solution  to  a  quadratic  program  (QP)  of  the  form 


j 


! 


) 

I 

s 


min  $Tp  +  l~PTHp 
r€S"  2 


(7.1) 


subject  to 


-bL  <  Ap  +  c  <  b1’ 

where 

b'  >  0  and  bv  >  0. 

In  SQr  methods  for  constrained  optimization,  //  approximates  the  Hessian  of  a  Lagrangian 
funtcion  in  order  to  take  into  account  the  curvature  of  the  constraints  that  are  active  at  the 
solution  (e.  g.,  Powell  [1983],  Gill  et  al.  [1985b,  1986b],  Nocedal  and  Overton  [1985],  Stoer 
[1985],  and  Gurwitz  [1986]).  For  nonlinear  least  squares,  it  suffices  for  H  to  approximate  the 
Hessian  matrix  of  \  JT f  even  if  some  of  the  contraints  in  (7.1)  are  active  at  a  solution 
because  g{r")  =  0.  These  methods  have  the  potential  to  converge  faster  than  quasi-Newton 
methods  for  unconstrained  optimization,  since  only  the  projection  of  the  Hessian  in  the  null  space 
of  the  active  QP  constraint  normals  —  rather  than  the  full  Hessian  —  need  be  positive  definite 
as  a  condition  for  superlinear  convergence. 

Two  classes  of  suitable  QP  constraints  for  (7.1 )  are  described  .  constraints  on  the  directional 
atives  of  individual  residuals,  and  constraints  based  the  QR  factorization  of  J  A  departure 
m  other  algorithms  is  that  information  about  the  residuals,  and  interrelationships  between 
residuals,  can  be  used  to  construct  the  subproblems  (the  algorithm  of  Davidon  [1976]  is  an 
exception  —  see  Section  11).  In  the  SQP  algorithms,  a  set  C  of  desirable  constrains  is  chosen 
first,  which  may  be  infeasible  or  may  otherwise  exclude  all  suitable  search  directions.  For  example, 
such  a  set  of  constraints  is 


{^<i,Jp  =  <f>,\  »  =  1,2 . trr).  (7.2) 

Any  p  satisfying  Y<f>Jp  =  -if>,  is  a  descent  direction  for  <£,■  if  <*,■  ^  0  and  is  otherwise  orthogonal 
to  V<£,.  The  unconstrained  minimum  pQN  of  the  QP  objective  in  (7.1)  is  a  descent  direction 
for  the  nonlinear  least-squares  objective  provided  H  is  positive  definite.  Therefore,  as  long  as 
Pe,N  is  considered  satisfactory,  an  acceptible  search  direction  will  eventually  be  obtained  by  either 
removing  some  constraints  from  C,  or  else  by  perturbing  the  constraints  in  C  so  as  to  enlarge  the 
feasible  region.  Based  on  this  reasoning,  she  proposes  two  different  strategies  (which  could  also 
be  combined) 
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One  strategy  uses  a  QP  to  select  a  subset  of  constraints  in  C  as  the  feasible  region  for  (7.1). 
Several  quadratic  programs  may  be  solved  within  a  single  iteration  in  order  to  compute  a  search 
direction,  which  is  justified  for  two  reasons.  First,  starting  the  solution  process  for  a  QP  with 
information  about  the  solution  of  a  related  subproblem  can  often  lead  to  significant  savings  in 
QP  iterations  (see,  e.  g  ,  Gill  et  al.  (1985a)).  Also,  when  the  cost  of  a  function  evaluation  is 
much  greater  than  the  cost  of  a  QP  iteration,  the  effort  involved  in  obtaining  the  search  direction 
by  solving  more  than  one  subproblem  may  be  worthwhile  if  it  results  in  a  substantial  reduction 
in  the  number  of  outer  iterations. 

It  is  difficult  to  automate  the  selection  of  QP  constraints,  and  the  evaluation  of  the  current 
QP  solution  as  a  candidate  for  the  search  direction.  One  strategy  considers  each  of  the  constraints 
in  (7.2)  separately  in  order  of  decreasing  residual  size,  with  the  object  of  including  as  many  of  the 
constraints  as  possible.  A  constraint  is  added  to  the  current  constraint  set  (initially  empty)  if  the 
corresponding  QP  computes  an  "acceptable"  search  direction  p.  In  addition  to  the  requirement 
that  gTp  <  0,  Fraley  uses  a  lower  bound  on  the  magnitude  of  p,  and  an  upper  bound  on 
|cos(<7.  ;1)|,  as  the  criteria  for  accepting  f>.  Some  other  examples  that  use  constraints  based  on 
the  QR  factorization  are  very  similar  to  corrected  Gauss-Newton  methods  (Section  4). 

In  the  second  approach,  constraints  in  C  are  modified  in  order  to  obtain  a  suitable  feasible 
region  This  is  accomplished  by  treating  constraint  bounds  as  variables  in  a  QP.  Using  the 
constraint  set  (7.2).  Fraley  shows  how  these  SQP  algorithms  are  related  to  Gauss-Newton  and 
Levenberg-Marquardt  methods.  The  QP 

min  ftTft 

b.p 


subject  to 


-ft  <  Jp  +  /  <  6 
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b  >  0. 

computes  the  smallest  possible  perturbation  that  allows  all  of  the  (7.2)  to  intersect.  In  the  solution 
(ft ;  p)  to  (7.3),  the  vector  p  is  a  Gauss-Newton  search  direction.  When  J  is  ill-conditioned,  it  is 
possible  that  the  constraints  in  (7.2)  do  intersect  (ft  =  0),  but  that  the  intersection  occurs  at  a 
vector  p  that  is  very  large  in  magnitude.  For  u>  >  0,  the  QP 

min  ftTft  +  u ;pTp 

b.p 


subject  to 


-ft  <  Jp  +  f  <  ft 


? 
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forces  ||/>||  to  increase  when  ||j)||  would  otherwise  be  large.  In  the  solution  (6:  />)  to  (7.4), 
the  vector  p  is  a  Levenberg-Marquardt  search  direction.  In  an  SQP  algorithm  based  on  (7.3) 
(respectively,  (7.4))  there  is  the  option  of  using  p  ( f> )  as  a  search  direction,  or  of  using  6  (6)  to 
define  bounds  for  a  second  QP  of  the  form  (7.1),  from  which  the  search  direction  is  computed. 

Fraley  proposes  a  number  of  variations  of  these  basic  SQP  algorithms  and  tests  some  of  them 
on  a  set  of  fourteen  problems.  She  uses  the  BFGS  method  to  approximate  H  in  (7.1)  just  as 
in  unconstrained  optimization,  and  observes  that  the  approximation  retains  positive  definiteness 
throughout  She  finds  the  SQP  methods  work  well  on  some  problems,  and  poorly  on  some  others, 
so  that  it  is  not  possible  to  say  anything  conclusive  about  their  performance  relative  to  existing 
methods. 

8.  Continuation  Methods 

Continuation  methods  have  also  been  applied  to  nonlinear  least-squares  problems.  These 
methods  solve  a  sequence  of  parameterized  subproblems 

min  $(t;  ry);  i  =  l,2,...,rm«,  (8.1) 


where 


0  =  r0  <  7-j  <  .  .  .  <  Tim„  =  1 


arg  min  <f>(7;0)  =  z0  and  arg  min  $(*;  1)  =  xm. 

The  idea  is  that  methods  that  have  fast  local  convergence,  but  may  not  be  robust  in  a  global 
sense,  can  be  applied  to  solve  each  subproblem  in  relatively  few  steps,  because  information  from 
the  solution  of  previous  subproblems  may  be  used  to  predict  a  good  starting  value  for  the  next 


DeVilliers  and  Glasser  [1981]  define 

<Hx;r)=I||/(x)||^+i(r*-l)||/(To)||^  (S.2) 

where  £  is  a  positive  integer,  with  a  fixed  spacing  between  the  parameters  r,  in  (8.1).  They 
test  two  different  continuation  methods,  one  that  uses  Newton's  method  (with  linesearch)  to 
solve  the  intermediate  problems,  and  one  that  uses  a  Gauss-Newton  method  (with  linesearch). 
An  unspecified  "device"  is  included  in  the  implementation  of  both  minimization  techniques  to 
ensure  a  decrease  in  the  objective  at  every  iteration  The  continuation  methods  are  compared  with 


results  obtained  by  applying  both  minimization  algorithms  to  the  original  problem.  Intermediate 
subproblems  are  not  solved  exactly  ;the  criterion 


where  r,  =  10"?  if  i  <  and  < tm„  =  10"6.  is  used  to  dete'mine  convergence  of  a  subptoblem. 

Numerical  experiments  are  carried  out  on  three  different  test  problems,  with  multiple  starting 
values,  most  of  which  are  points  of  failure  for  both  Newton's  method  and  Gauss-Newton.  They 
conclude  that,  although  the  continuation  method  is  less  efficient  than  the  underlying  method 
when  both  are  successful,  it  will  converge  on  many  problems  for  which  the  underlying  method 
fails  when  used  alone.  However,  the  results  they  present  are  for  different  values  of  the  step 
size,  and  the  exponent  k,  and  no  mechanism  is  given  for  the  automatic  choice  of  either  of  the 
parameters.  DeVilliers  and  Glasser  point  out  that  their  methods  may  require  modification  if 
the  optimization  method  that  is  used  to  solve  the  subproblems  encounters  difficulties,  or  if  the 
continuation  path  is  not  well-behaved.  Fraley  [l 987a ,  1988]  observes  that  the  first  two  test 
problems  of  DeVilliers  and  Glasser  are  very  sensitive  to  the  choice  of  the  maximum  step  bound, 
or  the  initial  trust-region  size  for  most  methods  and  that  the  methods  can  be  quite  efficient 
provided  an  appropriate  non-default  choice  is  made  for  these  parameters. 

Sa/ane  [1987]  incorporates  a  trust-region  strategy  into  a  continuation  method  by  defining 

r)  =  \  (||/(*)||2  4-  (r  -  1  )||/(xo)||j  +  A(r  -  l)IID(r  -  x0)||22)  ,  (8.3) 

and  then  applying  Gauss-Newton  to  this  function  for  the  inner  iterations.  Instead  of  allowing 
the  continuation  parameter  r  to  range  from  0  to  1,  he  advocates  stopping  when  it  becomes 
inefficient  to  solve  the  subproblems,  and  then  restarting  the  method  after  replacing  x0  by  the 
new  iterate.  He  points  out  that  his  approach  is  especially  suitable  for  large-residual  problems, 
because  it  transforms  the  original  problem  into  a  sequence  of  subproblems  with  small  residuals 
The  idea  is  to  attempt  to  determine  when  the  neglected  terms  become  significant,  and  then  pose 
a  new  subproblem.  An  initial  value,  rj,  of  the  continuation  parameter  must  be  supplied  by  the 
user  in  order  to  start  the  method.  Should  any  step  fail  to  obtain  a  decrease  in  either  the  nonlinear 
least-squares  objective  or  its  gradient,  rj  is  decreased,  and  the  calculation  is  repeated  without 
changing  x0.  Theorems  on  descent  conditions  and  convergence  are  presented.  Salane  argues 
that  his  continuation  method  allows  direct  selection  of  the  Levenberg-Marquardt  parameter  A 
in  (8.3),  because  A  may  be  chosen  so  that  the  term  A(1  -  t)DtD  behaves  somewhat  like  the 
second-order  terms  that  have  been  neglecled  in  the  Hessian  of  $(x;t).  However,  no  mechanism 
is  suggested  for  automatic  choice  of  A,  and  A  =  ||/(x0)|!2  IS  usec*  10  t*ie  <ests 
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Salane  gives  lest  results  for  a  version  of  his  algorithm  on  a  set  of  nine  problems  (all  of  which 
are  included  in  our  set).  A  comparison  is  made  to  results  obtained  from  MINPACK,  and  also  to 
the  results  reported  by  DeVilliers  and  Glasser  [1981]  for  two  of  the  test  problems.  He  concludes 
that  the  performance  of  the  method  compares  favorably  with  that  of  MINPACK,  and  is  superior 
to  the  DeVilliers  and  Glasser  continuation  method  on  the  relevant  problems.  The  matrix  D  in 
(8.3)  is  taken  to  be  the  identity  matrix  throughout  the  tests,  and  for  one  test  problem  a  type  of 
variable  scaling  is  used  No  information  is  given  concerning  scaling  for  the  MINPACK  tests.  The 
results  that  are  presented  correspond  to  several  different  values  of  rt,  although  the  criterion  used 
in  choosing  this  value  is  not  given.  Test  results  in  which  the  value  of  is  varied  ate  included  for 
three  of  the  problems  for  the  purpose  of  showing  that  performance  is  sensitive  to  the  specification 
of  the  continuation  parameter. 

9.  Modifications  of  Unconstrained  Optimization  Methods 

Besides  Gauss-Newton  methods,  several  straightforward  modifications  of  unconstrained  op¬ 
timization  methods  are  possible  for  nonlinear  least  squares  In  quasi-Newton  methods,  Jq  J0  can 
be  used  as  the  initial  approximation  to  the  Hessian  matrix.  Ramsin  and  Wedin  [1977]  report 
favorable  results  with  this  technique.  We  note  that  a  perturbed  matrix  JjJo  can  be  used  as  the 
initial  approximate  Hessian,  where  J 0  is  a  modified  Cholesky  factor  of  Jj J0  (Gill  and  Murray 
[1974])  ,  in  order  to  maintain  positive  definiteness  when  J0  is  ill-conditioned. 

Wedin  [1974]  (see  also  Ramsin  and  Wedin  [1977])  suggests  a  modification  of  Newton’s 
method  in  which  the  search  direction  is  defined  by 

in 

(.7TJ  +  ^^VV,)p  = -J7,  (»•!) 

i  =  i 

where  •f'i  is  the  rth  component  of  the  projection  /  of  /  onto  7?(.7).  This  iteration  approaches 
Newton's  method  in  the  limit,  since  f(r’)  =  f(r'),  and  is  parameter-independent,  in  the  sense 
that  minimization  of  /  as  a  function  of  7  is  equivalent  to  minimization  of  /  as  a  function  of  a  new 
variable  :  —  provided  the  mapping  that  defines  7  as  a  function  of  z  has  a  nonsingular  Jacobian. 
An  obvious  difficulty  is  that  /,  and  hence  (9.1),  is  not  well-defined  when  J  is  ill-conditioned. 

Recall  that  in  quasi-Newton  methods  for  unconstrained  optimization,  the  approximate  Hesian 
matrix  is  required  to  satisfy  the  condition 


where 


Hk«k  =  Vi" 


fk  ~  7jt  +  i  -  7*  and  jr*  =  0*41  -  0* 
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(e  g  Dennis  and  More  [1977]).  Al-Baali  and  Fletcher  [1985]  suggest  the  use  of  yk  defined  by 
(5.12)  rather  than  j/j,  =  gk^.\ -gk  in  the  quasi-Newton  condition  (9.2)  They  report  improvements 
with  the  FIFGS  and  DTP  formulas  when  this  substitution  is  made.  However,  they  remark  that 
the  condition  >  0  for  hereditary  positive  definiteness  of  the  updates  is  not  guaranteed  by  the 
linesearch  requirements,  and  they  replace  yjsk  in  the  update  formulas  by  max  {yJn/t.O.OlyJsjt} 
as  a  safeguard.  They  do  not  consider  this  a  major  drawback,  because  pjfk  >  0.01  j/J**  almost 
always  occurred  in  their  examples.  A  somewhat  different  safeguard  is  used  in  a  later  related 
paper  [see  Fletcher  and  Xu  (1986),  p.  26]  discussed  below. 

Al-Baali  and  Fletcher  [1985]  also  develop  several  hybrid  linesearch  methods  in  which  the 
models  are  assessed  in  terms  of  the  function  A*.  defined  by  (5.13),  an  approximate  measure  of 
the  error  in  the  inverse  Hessian.  In  one  class  of  methods,  the  modified  BFGS  update  described 
in  the  preceding  paragraph  is  applied  to  a  matrix  of  the  form 


//*  +  !  —  (1  ~  @k  )Hk  +  ^k  Tk  *4J'+1  A4I ' 

where  rk  minimizes  Ak{rkJ^l.Jk_kl\yk),  and  Ok  is  chosen  to  minimize  A* ( //jt  +  i ;  pk ),  in  order 
to  obtain  the  new  approximate  Hessian.  In  their  implementation,  in  which  0k  is  restricted  to  be 
either  0  or  1,  they  find  that  the  method  has  difficulties  on  singular  problems,  and  that  the  scaling 
of  the  search  direction  often  does  not  allow  a  =  1  as  a  trial  step  in  the  linesearch.  They  refer  to 
Al-Baali  [1984]  for  more  details  of  the  tests. 

Another  class  of  hybrid  methods  defined  by  Al-Baali  and  Fletcher  compares  the  value 

Aqn  =  A  k(]Jk'pk) 

for  the  current  quasi-Newton  approximation  Hk  with 

A  as  =  At(  Jk  +  l'  Vk) 


for  the  Gauss-Newton  approximation.  The  basic  algorithm  can  be  summarized  as  follows  : 
if  <  Ac„  then  use  the  modified  BFGS  search  direction 
else  use  the  Gauss-Newton  search  direction 


(9.3) 


They  test  several  versions  of  this  method  that  differ  in  the  action  taken  whenever  a  switch  from 
Gauss-Newton  to  quasi-Newton  takes  place,  in  one,  Hk  + 1  is  reset  to  A  +  i'  while  in  another 
Hk  +  \  is  reset  to  the  result  of  applying  the  modified  BFGS  update  to  +  1  (conjugate- 

gradient  acceleration).  They  observe  little  difference  in  performance  between  these  two  alterna¬ 
tives,  and  find  them  to  be  the  best  of  the  many  methods  for  nonlinear  least  squares  treated  in 
their  study.  A  version  of  the  first  strategy  that  substitutes  the  quantity  minr  Ai,(r.7tT+1.7jt  +  1 ;  yk ) 
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for  Agk  in  th«  comparison  with  AQN  is  also  tried,  but  it  is  found  to  have  some  difficulties 
on  a  problem  for  which  the  Jacobian  is  singular  at  the  solution.  A  final  variant  maintains  the 
quasi-Newton  update  throughout,  and  never  resets  the  approximate  Hessian.  They  find  that  this 
method  is  not  as  efficient  as  the  others  on  some  types  of  large-residual  problem. 


Fletcher  and  Xu  [1986]  give  an  example  in  which  the  hybrid  method  (9  3)  has  a  linear  rate 
of  convergence  when  the  BFGS  method  would  converge  superlinearly.  The  difficulty  is  that  the 
comparison  between  and  AON  may  fail  to  distinguish  between  zero-residual  problems  and 
those  with  nonzero  residuals.  They  propose  two  new  hybrid  algorithms  and  show  them  to  be 
superlinearly  convergent.  The  first  algorithm  computes  the  modified  BFGS  search  direction  if 

ll/(^)ll2-H/(n  +  1)||2  _ 

- FT77 — ui -  <  a'  (9-4) 

11/(7*  )l  l2 

for  some  fixed  a  £  (0,1),  and  a  Gauss-Newton  step  otherwise.  The  method  is  motivated  by  the 
following  relationship 


ll/(**Nl2-||/(**  +  l)|l2_  f0, 

ll/(n)|l2  l  1, 

The  second  algorithm  computes  a  modified  BFGS  step  if 


K  11/(01)2*0; 
if||/(r*)||2  =  0. 


11/(3-*)  ~  /(»*  +  !  )ll; 


<  tr  and 
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ll/(n)||2 

where  both  rr  and  7  are  fixed  parameters  in  (0,1),  and  a  Gauss-Newton  step  otherwise.  The 
additional  condition  for  choosing  the  BFGS  search  direction  is  derived  from  another  asymptotic 
relationship 


t-  +  y*)  _  f  0,  if  ||/(x  )||2  —  0; 

A  “tl.  if  11/(3-*  )ll2*o. 

Numerical  results  ate  given  for  set  of  fifty-six  test  problems,  a  few  with  multiple  starting  values. 
They  conclude  that  the  new  methods  offer  some  overall  improvement  over  those  based  on  (9.3), 
but  that  there  is  no  reason  to  prefer  the  more  complicated  test  (9.5)  over  (9.4). 


10.  Special  Linesearches 

Lindstrom  and  Wedin  [1984]  and  Al-8aali  and  Fletcher  [1986]  propose  specialized  linesearch 
methods  for  nonlinear  least-squares  problems  in  which  each  residual  is  interpolated  by  a  quadratic 
function,  in  contrast  to  the  strategy  of  interpolating  to  the  sum  of  squares  used  in  conventional 
linesearches  for  unconstrained  minimization.  As  a  result  a  quartic  polynomial,  rather  than  a 
simpler  cubic  or  quadratic,  is  minimized  at  each  iteration  of  the  linesearch 
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Lindstrom  and  Wedin  substitute  their  linesearch,  which  uses  only  function  values,  for  the 
quadratic  interpolation  and  cubic  interpolation  routines  in  the  NAG  Library  (1980  version)  non¬ 
linear  least-squares  algorithm  E04GBF  (see  Sections  4  and  5),  and  compare  the  performance  with 
the  NAG  linesearch  routines  on  a  set  of  eighteen  test  problems.  They  find  that  no  linesearch 
algorithm  is  superior  over  all,  but  that  their  algorithm  makes  a  better  initial  prediction  to  the 
steplength  that  minimizes  the  sum  of  squares  along  the  search  direction.  In  a  second  set  of  tests 
that  includes  multiple  starting  values  for  many  of  the  test  problems,  they  add  a  modified  version 
of  their  linesearch  algorithm  that  reverts  to  a  simple  backtracking  strategy  if  an  acceptable  de¬ 
crease  in  the  sum  of  squares  is  not  obtained  after  two  function  evaluations.  They  observe  that 
their  modified  method  requires  fewer  function  evaluations  than  either  of  the  NAG  linesearch  rou¬ 
tines,  and  that  the  total  for  their  original  method  falls  between  cubic  interpolation  and  quadratic 
interpolation  to  the  sum  of  squares.  They  note  occasional  inefficiencies  in  their  methods  due  to 
extrapolation,  but  comment  that  such  effects  are  more  pronounced  for  quadratic  interpolation  of 
the  sum  of  squares. 

Al-Baali  and  Fletcher  [1986]  test  similar  linesearch  methods  that  use  gradients  on  a  set  of 
fifty- five  test  problems  with  a  number  of  nonlinear  least-squares  algorithms  described  in  Al-Baali 
[1984]  (see  also  Al-Baali  and  Fletcher  [1985]).  They  conclude  that  considerable  overall  savings 
can  be  made  by  interpolating  to  each  of  the  residuals  rather  to  than  the  sum  of  squares.  They 
also  obtain  favorable  results  for  two  different  schemes  designed  to  save  Jacobian  evaluations  in 
the  new  linesearch. 

11.  Methods  for  Special  Problem  Classes 

Algorithms  have  also  been  formulated  to  treat  some  special  cases  of  the  nonlinear  least- 
squares  problem.  For  example,  there  is  a  vast  literature  concerning  methods  specific  to  nonlinear 
equations  that  we  shall  make  no  attempt  to  survey  here. 

In  some  nonlinear  least-squares  problems,  the  vector  7  can  be  separated  into  two  sets  of 
variables,  say 

where  it  is  relatively  easy  to  minimize  the  sum  of  squares  as  a  function  of  y  alone.  A  fairly 
common  situation  of  this  type  is  one  in  which  y  is  the  set  of  variables  that  occur  linearly  in  all 
of  the  residuals,  so  that 

r  II' (OIL 

is  a  linear  least-squares  problem.  For  example,  exponential  fitting  problems  (see  Varah  [1985]) 
fall  into  this  category.  Methods  that  deal  with  separable  nonlinear  least-squares  problems  were 


ito  w.’^iv.’v.vjy.  vj  wvw  wrarwruntrim;.F>jr«:*  v  wv.rw  wv^:'<; . 


introduced  by  Golub  and  Pereyra  [1973],  Ruhe  and  Wedin  [1980]  survey  these  methods  and 
give  some  extensions.  They  describe  three  basic  algorithms,  all  of  which  use  Gauss-Newton  to 
minimize  the  sum  of  squares  as  a  function  of  y  The  methods  differ  in  the  definition  of  the 
quadratic  model  function  for  minimization  with  respect  to  c.  The  Jacobian  and  Hessian  of  the 
nonlinear  least-squares  objective  can  be  partitioned  as  follows: 


so  that 
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The  approximate  Hessians  that  are  considered  for  the  minimization  as  a  function  of  z  are 
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and 


(11.3) 


Algorithms  based  on  (11.1)  and  (11.2)  are  shown  to  converge  at  a  faster  rate  than  the  conven¬ 
tional  Gauss-Newton  method,  while  the  asymptotic  convergence  rate  for  (11.3)  may  be  much 
slower.  On  the  other  hand,  of  the  three  quadratic  models,  it  is  least  expensive  to  compute  so¬ 
lutions  with  the  approximate  Hessian  (11 .3),  and  most  expensive  to  compute  them  from  (1 1.1). 
Use  of  (112)  costs  about  the  same  as  a  conventional  Gauss-Newton  method.  Tests  on  four 
sample  problems  are  given  to  illustrate  rates  of  convergence. 

Davidon  [1976]  introduces  a  quasi-Newton  method  for  problems  in  which  (»)  m  >  n,  (it) 
location  of  the  minimum  is  not  very  sensitive  to  weighting  of  the  residuals,  and  (tit)  rapid 
approach  to  a  minimum  is  more  important  than  convergence  to  it.  A  new  estimate  of  the 
minimum  is  computed  after  each  individual  residual  and  its  gradient  are  evaluated,  rather  than 
after  evaluating  the  entire  block  of  m  residuals.  Davidon  gives  an  analogy  to  time-dependent 
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measurements  of  experimental  data,  in  which  quantities  calculated  from  the  measurements  are 
updated  each  time  a  new  observation  is  made.  Starting  from  an  initial  quadratic  approximation 

9o(-r)  =  /(3-o)T/(*o)  +  (<r  -  rof^o  V  -  ?o). 

with  7/0  positive-definite,  the  algorithm  that  determines  the  next  iterate  is  equivalent  to  mini¬ 
mizing  a  quadratic  function  of  the  form 

(r  -  n)T^f,(n)]2  +  A 

where  A*  is  in  (0,  l].  It  is  suggested  that  the  choice  of  {A*}  should  be  problem-dependent,  and 
some  alternatives  are  proposed.  Davidon  tests  the  method  on  a  set  of  four  problems  in  which  he 
varies  the  size  of  the  problem,  the  initial  estimate  of  the  solution,  and  the  sequence  {A*}.  He 
observes  that  the  method  tends  to  oscillate  about  a  minimum  rather  than  converging  to  it,  but 
that  it  often  reduces  the  sum  of  squares  more  rapidly  than  other  methods. 

Further  computational  experiments  with  Davidon's  method  are  reported  in  Cornwell,  Koc- 
man,  and  Prosser  [1980].  On  a  set  of  fifteen  zero-residual  problems,  they  test  the  method  with 
various  fixed  values  of  A*.  They  obtain  overflow  in  most  cases  for  small  values,  but  otherwise  find 
that  the  efficiency  of  the  method  decreases  as  A*  is  increased.  In  one  case,  the  method  cycled 
through  a  sequence  of  points  that  was  not  near-optimal.  On  the  basis  of  these  observations, 
they  implement  a  new  version  that  attempts  to  use  a  fixed,  relatively  small  value  of  A*. ,  restart¬ 
ing  from  the  initial  vector  with  a  larger  value  if  it  is  determined  that  overflow  would  otherwise 
occur.  They  find  that  this  modified  implementation  of  Davidon's  method  is  competitive  with 
the  computer  program  LMCHQL  from  Argonne  National  Laboratory  based  on  Fletcher  s  [1971] 
Levenberg-Marquardt  algorithm  (which  has  since  been  superseded  by  the  M1NPACK  routine 
LMDER  [More,  Garbow,  and  Hillstrom  (1980)]). 
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